Presence change storm putting increased load on server #16843

FlyveHest · 2024-01-23T17:14:47Z

Description

Six hours ago, my private Synapse server suddenly started putting 5 times the load on my server than usual.

I have narrowed it down to what seems about a handful of clients, based on IP, that GET and OPTION around 150 presence changes every second.

Users are running either Element Web latest version, or official clients on phones or desktop.

Searching issues, this might seem to somewhat related : #16705

Steps to reproduce

I don't know, the flood started 6 hours ago with no change in neither server nor clients (from what I can tell) and have been running since.

Restarting the server did nothing, the presence changes continued as soon as I restarted it.

Also tried waiting a minute before starting, same issue.

Homeserver

matrix.gladblad.dk

Synapse Version

1.99.0

Installation Method

Docker (matrixdotorg/synapse)

Database

SQLite

Workers

Single process

Platform

Ubuntu 20, in docker

Configuration

No response

Relevant log output

x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=online&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 917 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844286_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"

Anything else that would be useful to know?

No response

FlyveHest · 2024-01-24T09:52:30Z

Following up a bit, over the night the problem subsided for a few hours, possibly because clients where disconnected entirely (machine shut down), but it started again early this morning, with 2 clients showing the same symptoms.

Over the course of the morning, it has now become 6 clients that do this, with around 35 requests a second, in what looks like a GET and then OPTIONS cycle.

FlyveHest · 2024-01-24T11:45:59Z

Disabling presence on homeserver seems to have removed the problem (not that surprising)

I am wondering how it began, though, nothing have changed on the homeserver, which has been running for years, and all of a sudden a multitude of different clients start flooding presence state changes.

agrimpard · 2024-01-24T13:47:07Z

Hello, I've had exactly the same problem.
I realised it when disk space alerts appeared with apache/matrix/systemd logs taking up more than 2GB each per day and the processor struggling.

Disk space

Apache CPU

Apache total requests

After investigation, I found the same thing: clients spam the server with GET/OPTION sync requests and disabling presence solves the problem.

When I noticed the problem, the clients had Element 1.55 and the web clients 1.53. Synapse was at 1.93. After updating the clients to the latest version and synapse to 1.99, I still had the same problem. There were no configuration changes before/during the problem, apart from activating/deactivating presences.

I have around thirty active users and it's becoming a problem very, very quickly.

Installation Method: Debian package
Database: Postgres
Workers: Single process
Platform: Debian 11

rda0 · 2024-01-25T06:53:32Z

I have the same problem. Extreme CPU usage in the presence and generic-sync workers. It gradually started appearing on the 22.01.2024 at around 13:00 CET with version 1.98.0 (with no changes to the server since the restart into 1.98.0 on the 22.12.2023). During evening/night time the graphs looked quiet and as usual. The CPU usage went up again suddenly on the next day (23.) during office hours (and continued after an upgrade to 1.99.0). I also assume that it could be caused by certain clients or client versions.

rda0 · 2024-01-25T07:03:05Z

The symptoms look like matrix-org/synapse#16057, which did not reappear for months until now.

plui29989 · 2024-01-29T20:17:49Z

We have this problem as well.

Synapse Version: 1.99.0
Installation Method: Docker (matrixdotorg/synapse)
Database: Postgres
Workers: Single process
Platform: Ubuntu 22.04, in docker

gc-holzapfel · 2024-01-30T08:23:13Z

here as well, same config:

Synapse 1.99.0 (matrixdotorg/synapse)
Docker & Postgres on Ubuntu 22.04

rogue client is on osx, v1.11.56-rc.0

rda0 · 2024-01-30T08:48:16Z

@gc-holzapfel How did you identify the client? I am just interested, since I looked at clients versions and usernames in log snippets of good and bad minutes, but could not find a pattern. It always seemed as if all of them were involved in the ~10x request count during bad minutes.

gc-holzapfel · 2024-01-30T08:51:22Z

i had a look at the synapse/docker-logs and there is (so far) only one misbehaving client...although we have other users with same os&client that do not cause problems.

meise · 2024-02-05T12:01:09Z

I can see this behaviour also on one of our Synapse 1.99 instances. CPU and HTTP request increase 10 to 20 times more than usual. general _load_next_mult_id and general MultiWriterIdGenerator._update_table in DB transactions by txn rate increase in the same range:

OneBlue · 2024-02-06T21:05:50Z

I'm hitting the same issue with multiple clients on :
{"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:

presence:
  enabled: false

Solves the issue.

Ronchonchon45 · 2024-02-09T09:03:59Z

I had the same issue on the same date; January 23, 2023 around 10:30 a.m.
My server version was 1.98, he was updated to 1.100, but the issue persists.
We disabled presence to stop the issue, but we need to have this option enabled.

Synapse Version: 1.100
Installation Method: repo apt
Database: Postgres 15
Python: 3.11.2
Platform: debian 12

agrimpard · 2024-02-27T14:10:21Z

I have three separate environments, test with synapse monolith, preprod with synapse workers, prod with synapse workers and I still see the problem. It doesn't matter whether I update Synapse or Element, as soon as I reactivate the presences, we DDOS ourselves.

Would it be possible, even if it's a synapse bug, to limit the number of API GET/OPTION sync calls in Element to, I don't know, one every 10 seconds? At least not 15 per second ... is it really useful to make so many calls in such a short space of time?

Presence is a feature we'd really like to keep active!

jlemangarin · 2024-03-15T16:14:59Z

Still HARD flooding the browser and the server
1.98.0

raphaelbadawi · 2024-03-15T16:27:43Z

I'm hitting the same issue with multiple clients on : {"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:
presence:
  enabled: false
Solves the issue.

In Synapse 1.98 this configuration doesn't seem to work anymore. So flooding issue still present and not bypassable...

avanhovedev · 2024-03-15T17:01:59Z

Same problem
1.98.0

bjoe2k4 · 2024-03-15T17:39:34Z

On 1.101.0 (~ 150 MAU, no workers) and disabling presence is working for me. However it is a disaster for me having to disable a very basic and important feature of synapse with no fix in sight while new releases (not fixing the problem) keep flying in.

I'm a small fish, but I'm wondering what the guys with the big servers are doing to help with the issue.

jlemangarin · 2024-03-15T17:48:29Z

@clokep Do you have insight on this considering your work on matrix-org/synapse#16544 ?

clokep · 2024-03-15T22:45:31Z

@clokep Do you have insight on this considering your work on matrix-org/synapse#16544 ?

No. Maybe there's a misconfiguration? Multiple workers fighting or something. I haven't read in depth what's going on though -- I'm no longer working at Element.

jlemangarin · 2024-03-18T10:44:50Z

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

Take 2 Web Element client (A and B) on the same account
Then a third one on another account (Called C)
Actively use A and C
Put B on another tab of another window you don't touch
Wait 5 min
Open the browser console of C (if your browser hadn't already crashed)

Whatever the issue is, it shows a lack of programming security at multiple layers :

Sync polling shouldn't flood like this (Min interval between 2 answers)
Presence events should be monitored by the server to prevent this (Like a user shouldn't be able to change it's presence every ms)
And for sure, multiple clients sending conflicting presence should be properly merged (which from what I have understood should have been done in attached issues)

We will try to dirty patch it for now and put the patch here if this is relevant.

FlyveHest · 2024-03-18T12:17:29Z

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

agrimpard · 2024-03-18T12:24:59Z

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

For me it's both.
Element shouldn't make so many calls on its own.
Synapse should block if too many calls are made, otherwise any client can initiate a DDOS on its APIs.

FlyveHest · 2024-03-18T13:09:48Z

For me it's both.

That would make the most sense, yes, and in fact, it would make the most sense that the server isn't DDoSable through its official APIs.

jlemangarin · 2024-03-18T13:37:57Z

I think the client properly send it's own status, then just polls and wait for an answer as a classic polling would do (even if also as you mentionned, polling flood could actually be identified client-side and prevented/delayed)

But yeah, here the server have the main role, processing the conflicting presence requests and controling properly the polling system.

Aside of a polling flood control, a simple fix in the presence feature may be :

If the server already processed an BUSY presence in the previous POLL_INTERVAL then it should stay BUSY
If not - If the server already processed an ONLINE presence in the previous POLL_INTERVAL then it should stay ONLINE
....

So it would stay in the same defined order BUSY > ONLINE > ...

I think this is approximately the hotfix we are trying to apply rn, we'll keep you updated about that

cc. @raphaelbadawi

raphaelbadawi · 2024-03-19T17:56:28Z

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

clokep · 2024-03-19T17:58:39Z

force-enabled multi-tab sessions

What does this mean?

raphaelbadawi · 2024-03-19T18:05:04Z

force-enabled multi-tab sessions

What does this mean?

We patched matrix-react-sdk so we can have multi-tabs session. This is why we had this flood: if a tab was awake and another was inactive, it kept syncing online->inactive->online->inactive into the same device id. The previous fix avoided flood among different devices, but not in this peculiar case.

clokep · 2024-03-19T18:06:38Z

What does "multi-tab sessions" mean? Does it mean you're logging into multiple different accounts? This sounds like it might be a client issue then if the data isn't properly segmented between the accounts or something like that.

jlemangarin · 2024-03-19T18:19:33Z

Confirmed it (finally 🥳 ) passed all QA tests with different presence states properly merged and not flooding the polling anymore !

=>

@clokep I think @raphaelbadawi says that he have multiple tabs (on the same account) sending different presence states, as if you have multiple clients, and then a third one on another browser that got flooded, this is the test case I use.

Thank you, finally passed the tests for production !

Next step would be to confirm how this could be introduced in the recently changed state merger mechanism

raphaelbadawi · 2024-03-19T18:22:36Z

the have multiple tabs (on the same account) sending different presence states, as if you have multiple clients

Yes this was my use case, same user id, same token, but on multiple tabs.

jlemangarin · 2024-03-21T10:15:58Z

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

@agrimpard Could you try and give us a feedback ? Just got live in production today and seems to work !

cc. @Ronchonchon45 @OneBlue @gc-holzapfel @plui29989 @FlyveHest

Fix : element-hq#16843 Patch from : element-hq#16843 (comment)

agrimpard · 2024-03-21T14:26:10Z

I've just tested the patch.

I started by stopping Matrix, I applied the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, I reactivated the presences and I restarted Matrix. I don't see any "presence storm".

I then stopped Matrix again, removed the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, presences still enabled and restarted Matrix. I no longer see any "presence storm".

I should have started by reactivating the presences without the patch to see a real change. I wonder if the patch hasn't done some sort of reset on the presences and so, even without the patch, I'm out of the presence bug.
I need to do some more tests but it doesn't look bad at first glance :)

For those who want to test, you have the source presence.py patched above this message ...
Make sure you test the presences without the patch first !

Ronchonchon45 · 2024-03-27T14:24:15Z

I applied the patch on Monday (25th). Always was good, but this morning at 10AM, the problem is back.
But now, if I disable presence, the problem persist. I've disabled access_log for ngninx, and INFO log for the application to stopped the problem

raphaelbadawi · 2024-03-27T14:48:02Z

Hello @Ronchonchon45 . How do you reproduce the issue on your side ? For me it was when being logged in with the same user on several tabs at the same time.

Ronchonchon45 · 2024-03-28T09:21:46Z

I don't know how the problem came back.
No problem after patch in 2 days, and Wednesday at 3pm, I saw that the disk space was low, and that the log file size was big since 10am (1.5g in 5 hours).
How can I check if users is logged on several tabs at the same time?

raphaelbadawi · 2024-03-28T10:13:32Z

What do you have in the actual logs ? Is it flooding user presence state update (if a user state blinks rapidly between two states it may be related to multitab) or is it something else ?

ChristophWolffNabu · 2024-11-02T18:04:34Z

Still an issue on v1.118.0

ChristophWolffNabu · 2024-11-02T18:07:18Z

Fixed it temp. by closing al Browsertabs

rda0 · 2024-11-05T10:54:45Z

A workaround could be to rate-limit it on the reverse proxy. I did this similarly for #16987.

For example using nginx this could look something like (this probably also limits sync requests):

http {
    map $query_string $map_query_param_matrix_client_sync_set_presence {
        default "";
        ~(^|&)set_presence= $binary_remote_addr;
    }
    map $request_method $map_request_method_matrix_client_sync_set_presence {
        default "";
        GET $map_query_param_matrix_client_sync_set_presence;
    }
    limit_req_zone $map_request_method_matrix_client_sync_set_presence zone=matrix_client_sync_set_presence:10m rate=2r/s;

    server {
        location ~ ^/_matrix/client/v3/sync {
            limit_req zone=matrix_client_sync_set_presence burst=4 nodelay;
            proxy_pass http://$workers;
            proxy_read_timeout 600s;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $remote_addr;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_http_version 1.1;
            client_max_body_size 125M;
            access_log /var/log/nginx/matrix._matrix.log vhost_combined_tls_worker_withreq;
        }
    }
}

rda0 · 2024-12-04T13:36:41Z

This is still a problem with v1.120.2. It is not always a problem but often recently, usually on Mondays, this week every day but not all day long. So I still think this is client related but since we have basically no control over the clients it must be fixed on the server side.

How can we debug/fix this on the server side?
I think #16705 could be a viable solution.

Today I tracked down all Element Web/Desktop users/computers with versions older than Element/1\.11\.8[567] and rejected their IPs on the server. This alone did not stop the presence storm. But the CPU usage went down after a Synapse restart (which could be a coincidence). CPU usage stayed low even after unblocking the IPs.

bjoe2k4 · 2024-12-04T13:52:16Z

For me things luckily have calmed down since July and stayed pretty stable since then.

I've considered rate-limiting by IP on the reverse proxy, but this immediately becomes an issue as soon as multiple users are connected via NAT (VPN or WiFi).

So, the rate-limiting if implemented correctly has to happen server side and on a per authenticated user basis.

rda0 · 2024-12-04T19:35:39Z

So, the rate-limiting if implemented correctly has to happen server side and on a per authenticated user basis.

I agree.

I implemented an rc_set_presence.per_user ratelimit in rda0@edb93a0. It is live on my test server and appears to work (it did not explode and a quick presence test looks good). I will test it on my production server on the next occasion to see if it helps.

rda0 · 2024-12-05T19:26:18Z

I created a PR #18000 which I tested toady on our production server with success. A detailed report is in #18000 (comment). Would be great if somebody else could test it.

Patch:
patch_of_18000_for_v1.120.2.patch

FlyveHest mentioned this issue Jan 24, 2024

/_matrix/client/v3/sync never finishes #16845

Closed

realtyem mentioned this issue Feb 23, 2024

Element displays error that it failed to restore the session when the network is not yet connected when it launches element-hq/element-web#26967

Closed

This comment was marked as resolved.

Sign in to view

agrimpard added a commit to agrimpard/synapse that referenced this issue Mar 21, 2024

Fix presence storm

8b6ab2c

Fix : element-hq#16843 Patch from : element-hq#16843 (comment)

rda0 mentioned this issue Dec 5, 2024

Ratelimit set_presence updates #18000

Open

3 tasks

Presence change storm putting increased load on server #16843

Presence change storm putting increased load on server #16843

Comments

FlyveHest commented Jan 23, 2024

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

Anything else that would be useful to know?

FlyveHest commented Jan 24, 2024

FlyveHest commented Jan 24, 2024

agrimpard commented Jan 24, 2024 • edited Loading

rda0 commented Jan 25, 2024

rda0 commented Jan 25, 2024

plui29989 commented Jan 29, 2024

gc-holzapfel commented Jan 30, 2024 • edited Loading

rda0 commented Jan 30, 2024

gc-holzapfel commented Jan 30, 2024 • edited Loading

meise commented Feb 5, 2024

OneBlue commented Feb 6, 2024

Ronchonchon45 commented Feb 9, 2024

agrimpard commented Feb 27, 2024 • edited Loading

jlemangarin commented Mar 15, 2024

raphaelbadawi commented Mar 15, 2024

avanhovedev commented Mar 15, 2024

bjoe2k4 commented Mar 15, 2024

jlemangarin commented Mar 15, 2024 • edited Loading

clokep commented Mar 15, 2024

jlemangarin commented Mar 18, 2024 • edited Loading

FlyveHest commented Mar 18, 2024

agrimpard commented Mar 18, 2024

FlyveHest commented Mar 18, 2024

jlemangarin commented Mar 18, 2024

raphaelbadawi commented Mar 19, 2024

clokep commented Mar 19, 2024

raphaelbadawi commented Mar 19, 2024

clokep commented Mar 19, 2024

jlemangarin commented Mar 19, 2024

raphaelbadawi commented Mar 19, 2024

jlemangarin commented Mar 21, 2024

This comment was marked as resolved.

agrimpard commented Mar 21, 2024 • edited Loading

Ronchonchon45 commented Mar 27, 2024

raphaelbadawi commented Mar 27, 2024

Ronchonchon45 commented Mar 28, 2024

raphaelbadawi commented Mar 28, 2024

ChristophWolffNabu commented Nov 2, 2024

ChristophWolffNabu commented Nov 2, 2024

rda0 commented Nov 5, 2024 • edited Loading

rda0 commented Dec 4, 2024

bjoe2k4 commented Dec 4, 2024

rda0 commented Dec 4, 2024

rda0 commented Dec 5, 2024 • edited Loading

agrimpard commented Jan 24, 2024 •

edited

Loading

gc-holzapfel commented Jan 30, 2024 •

edited

Loading

gc-holzapfel commented Jan 30, 2024 •

edited

Loading

agrimpard commented Feb 27, 2024 •

edited

Loading

jlemangarin commented Mar 15, 2024 •

edited

Loading

jlemangarin commented Mar 18, 2024 •

edited

Loading

agrimpard commented Mar 21, 2024 •

edited

Loading

rda0 commented Nov 5, 2024 •

edited

Loading

rda0 commented Dec 5, 2024 •

edited

Loading