Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presence change storm putting increased load on server #16843

Open
FlyveHest opened this issue Jan 23, 2024 · 44 comments
Open

Presence change storm putting increased load on server #16843

FlyveHest opened this issue Jan 23, 2024 · 44 comments

Comments

@FlyveHest
Copy link

Description

Six hours ago, my private Synapse server suddenly started putting 5 times the load on my server than usual.

I have narrowed it down to what seems about a handful of clients, based on IP, that GET and OPTION around 150 presence changes every second.

Users are running either Element Web latest version, or official clients on phones or desktop.

Searching issues, this might seem to somewhat related : #16705

Steps to reproduce

I don't know, the flood started 6 hours ago with no change in neither server nor clients (from what I can tell) and have been running since.

Restarting the server did nothing, the presence changes continued as soon as I restarted it.

Also tried waiting a minute before starting, same issue.

Homeserver

matrix.gladblad.dk

Synapse Version

1.99.0

Installation Method

Docker (matrixdotorg/synapse)

Database

SQLite

Workers

Single process

Platform

Ubuntu 20, in docker

Configuration

No response

Relevant log output

x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 916 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "GET /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=online&since=s322387_6844284_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 200 917 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=2&timeout=30000&set_presence=unavailable&since=s322387_6844285_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 OPR/106.0.0.0"
x.x.x.x - - [23/Jan/2024:17:53:32 +0100] "OPTIONS /_matrix/client/v3/sync?filter=3&timeout=30000&set_presence=unavailable&since=s322387_6844286_4_361249_403590_75_6308_2754_0_1 HTTP/1.1" 204 406 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.55 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36"

Anything else that would be useful to know?

No response

@FlyveHest
Copy link
Author

Following up a bit, over the night the problem subsided for a few hours, possibly because clients where disconnected entirely (machine shut down), but it started again early this morning, with 2 clients showing the same symptoms.

Over the course of the morning, it has now become 6 clients that do this, with around 35 requests a second, in what looks like a GET and then OPTIONS cycle.

@FlyveHest
Copy link
Author

Disabling presence on homeserver seems to have removed the problem (not that surprising)

I am wondering how it began, though, nothing have changed on the homeserver, which has been running for years, and all of a sudden a multitude of different clients start flooding presence state changes.

@agrimpard
Copy link
Contributor

agrimpard commented Jan 24, 2024

Hello, I've had exactly the same problem.
I realised it when disk space alerts appeared with apache/matrix/systemd logs taking up more than 2GB each per day and the processor struggling.

Disk space
Screenshot 2024-01-24 at 14-55-11

Apache CPU
Screenshot 2024-01-24 at 14-57-23

Apache total requests
Screenshot 2024-01-24 at 15-00-47

After investigation, I found the same thing: clients spam the server with GET/OPTION sync requests and disabling presence solves the problem.

When I noticed the problem, the clients had Element 1.55 and the web clients 1.53. Synapse was at 1.93. After updating the clients to the latest version and synapse to 1.99, I still had the same problem. There were no configuration changes before/during the problem, apart from activating/deactivating presences.

I have around thirty active users and it's becoming a problem very, very quickly.

Installation Method: Debian package
Database: Postgres
Workers: Single process
Platform: Debian 11

@rda0
Copy link

rda0 commented Jan 25, 2024

I have the same problem. Extreme CPU usage in the presence and generic-sync workers. It gradually started appearing on the 22.01.2024 at around 13:00 CET with version 1.98.0 (with no changes to the server since the restart into 1.98.0 on the 22.12.2023). During evening/night time the graphs looked quiet and as usual. The CPU usage went up again suddenly on the next day (23.) during office hours (and continued after an upgrade to 1.99.0). I also assume that it could be caused by certain clients or client versions.

synapse_presence_2024_01_25

@rda0
Copy link

rda0 commented Jan 25, 2024

The symptoms look like matrix-org/synapse#16057, which did not reappear for months until now.

@plui29989
Copy link

We have this problem as well.

Synapse Version: 1.99.0
Installation Method: Docker (matrixdotorg/synapse)
Database: Postgres
Workers: Single process
Platform: Ubuntu 22.04, in docker

@gc-holzapfel
Copy link

gc-holzapfel commented Jan 30, 2024

here as well, same config:

Synapse 1.99.0 (matrixdotorg/synapse)
Docker & Postgres on Ubuntu 22.04

rogue client is on osx, v1.11.56-rc.0

@rda0
Copy link

rda0 commented Jan 30, 2024

@gc-holzapfel How did you identify the client? I am just interested, since I looked at clients versions and usernames in log snippets of good and bad minutes, but could not find a pattern. It always seemed as if all of them were involved in the ~10x request count during bad minutes.

@gc-holzapfel
Copy link

gc-holzapfel commented Jan 30, 2024

i had a look at the synapse/docker-logs and there is (so far) only one misbehaving client...although we have other users with same os&client that do not cause problems.

@meise
Copy link
Contributor

meise commented Feb 5, 2024

I can see this behaviour also on one of our Synapse 1.99 instances. CPU and HTTP request increase 10 to 20 times more than usual. general _load_next_mult_id and general MultiWriterIdGenerator._update_table in DB transactions by txn rate increase in the same range:
2024-02-05-125905_740x271_scrot
2024-02-05-124029_740x683_scrot

@OneBlue
Copy link

OneBlue commented Feb 6, 2024

I'm hitting the same issue with multiple clients on :
{"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:

presence:
  enabled: false

Solves the issue.

@Ronchonchon45
Copy link

I had the same issue on the same date; January 23, 2023 around 10:30 a.m.
My server version was 1.98, he was updated to 1.100, but the issue persists.
We disabled presence to stop the issue, but we need to have this option enabled.

Synapse Version: 1.100
Installation Method: repo apt
Database: Postgres 15
Python: 3.11.2
Platform: debian 12

@agrimpard
Copy link
Contributor

agrimpard commented Feb 27, 2024

I have three separate environments, test with synapse monolith, preprod with synapse workers, prod with synapse workers and I still see the problem. It doesn't matter whether I update Synapse or Element, as soon as I reactivate the presences, we DDOS ourselves.

Would it be possible, even if it's a synapse bug, to limit the number of API GET/OPTION sync calls in Element to, I don't know, one every 10 seconds? At least not 15 per second ... is it really useful to make so many calls in such a short space of time?

Presence is a feature we'd really like to keep active!

@jlemangarin
Copy link

Still HARD flooding the browser and the server
1.98.0

@raphaelbadawi
Copy link

I'm hitting the same issue with multiple clients on : {"server_version":"1.87.0","python_version":"3.7.3"}

I confirm that disabling presence via:

presence:
  enabled: false

Solves the issue.

In Synapse 1.98 this configuration doesn't seem to work anymore. So flooding issue still present and not bypassable...

@avanhovedev
Copy link

Same problem
1.98.0

@bjoe2k4
Copy link

bjoe2k4 commented Mar 15, 2024

On 1.101.0 (~ 150 MAU, no workers) and disabling presence is working for me. However it is a disaster for me having to disable a very basic and important feature of synapse with no fix in sight while new releases (not fixing the problem) keep flying in.

I'm a small fish, but I'm wondering what the guys with the big servers are doing to help with the issue.

@jlemangarin
Copy link

jlemangarin commented Mar 15, 2024

@clokep Do you have insight on this considering your work on matrix-org/synapse#16544 ?

@clokep
Copy link
Contributor

clokep commented Mar 15, 2024

@clokep Do you have insight on this considering your work on matrix-org/synapse#16544 ?

No. Maybe there's a misconfiguration? Multiple workers fighting or something. I haven't read in depth what's going on though -- I'm no longer working at Element.

@jlemangarin
Copy link

jlemangarin commented Mar 18, 2024

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

  • Take 2 Web Element client (A and B) on the same account
  • Then a third one on another account (Called C)
  • Actively use A and C
  • Put B on another tab of another window you don't touch
  • Wait 5 min
  • Open the browser console of C (if your browser hadn't already crashed)

image

Whatever the issue is, it shows a lack of programming security at multiple layers :

  • Sync polling shouldn't flood like this (Min interval between 2 answers)
  • Presence events should be monitored by the server to prevent this (Like a user shouldn't be able to change it's presence every ms)
  • And for sure, multiple clients sending conflicting presence should be properly merged (which from what I have understood should have been done in attached issues)

We will try to dirty patch it for now and put the patch here if this is relevant.

@FlyveHest
Copy link
Author

Actually this is really simple to reproduce, just did it in 2 min, you just have to :

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

@agrimpard
Copy link
Contributor

Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour.

For me it's both.
Element shouldn't make so many calls on its own.
Synapse should block if too many calls are made, otherwise any client can initiate a DDOS on its APIs.

@FlyveHest
Copy link
Author

For me it's both.

That would make the most sense, yes, and in fact, it would make the most sense that the server isn't DDoSable through its official APIs.

@jlemangarin
Copy link

I think the client properly send it's own status, then just polls and wait for an answer as a classic polling would do (even if also as you mentionned, polling flood could actually be identified client-side and prevented/delayed)

But yeah, here the server have the main role, processing the conflicting presence requests and controling properly the polling system.

Aside of a polling flood control, a simple fix in the presence feature may be :

  • If the server already processed an BUSY presence in the previous POLL_INTERVAL then it should stay BUSY
  • If not - If the server already processed an ONLINE presence in the previous POLL_INTERVAL then it should stay ONLINE
  • ....

So it would stay in the same defined order BUSY > ONLINE > ...

image

I think this is approximately the hotfix we are trying to apply rn, we'll keep you updated about that

cc. @raphaelbadawi

@raphaelbadawi
Copy link

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

@clokep
Copy link
Contributor

clokep commented Mar 19, 2024

force-enabled multi-tab sessions

What does this mean?

@raphaelbadawi
Copy link

force-enabled multi-tab sessions

What does this mean?

We patched matrix-react-sdk so we can have multi-tabs session. This is why we had this flood: if a tab was awake and another was inactive, it kept syncing online->inactive->online->inactive into the same device id. The previous fix avoided flood among different devices, but not in this peculiar case.

@clokep
Copy link
Contributor

clokep commented Mar 19, 2024

What does "multi-tab sessions" mean? Does it mean you're logging into multiple different accounts? This sounds like it might be a client issue then if the data isn't properly segmented between the accounts or something like that.

@jlemangarin
Copy link

Confirmed it (finally 🥳 ) passed all QA tests with different presence states properly merged and not flooding the polling anymore !

image

image

=>

image

=>

image

@clokep I think @raphaelbadawi says that he have multiple tabs (on the same account) sending different presence states, as if you have multiple clients, and then a third one on another browser that got flooded, this is the test case I use.

Thank you, finally passed the tests for production !

Next step would be to confirm how this could be introduced in the recently changed state merger mechanism

@raphaelbadawi
Copy link

the have multiple tabs (on the same account) sending different presence states, as if you have multiple clients

Yes this was my use case, same user id, same token, but on multiple tabs.

@jlemangarin
Copy link

Hello,

I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.

synapse_16843.txt

git apply --ignore-space-change --ignore-whitespace synapse_16843.txt

@agrimpard Could you try and give us a feedback ? Just got live in production today and seems to work !

cc. @Ronchonchon45 @OneBlue @gc-holzapfel @plui29989 @FlyveHest

@agrimpard

This comment was marked as resolved.

agrimpard added a commit to agrimpard/synapse that referenced this issue Mar 21, 2024
@agrimpard
Copy link
Contributor

agrimpard commented Mar 21, 2024

I've just tested the patch.

I started by stopping Matrix, I applied the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, I reactivated the presences and I restarted Matrix. I don't see any "presence storm".

I then stopped Matrix again, removed the patch on /opt/venvs/matrix-synapse/lib/python3.11/site-packages/synapse/handlers/presence.py, presences still enabled and restarted Matrix. I no longer see any "presence storm".

I should have started by reactivating the presences without the patch to see a real change. I wonder if the patch hasn't done some sort of reset on the presences and so, even without the patch, I'm out of the presence bug.
I need to do some more tests but it doesn't look bad at first glance :)

For those who want to test, you have the source presence.py patched above this message ...
Make sure you test the presences without the patch first !

@Ronchonchon45
Copy link

I applied the patch on Monday (25th). Always was good, but this morning at 10AM, the problem is back.
But now, if I disable presence, the problem persist. I've disabled access_log for ngninx, and INFO log for the application to stopped the problem

@raphaelbadawi
Copy link

Hello @Ronchonchon45 . How do you reproduce the issue on your side ? For me it was when being logged in with the same user on several tabs at the same time.

@Ronchonchon45
Copy link

I don't know how the problem came back.
No problem after patch in 2 days, and Wednesday at 3pm, I saw that the disk space was low, and that the log file size was big since 10am (1.5g in 5 hours).
How can I check if users is logged on several tabs at the same time?

@raphaelbadawi
Copy link

What do you have in the actual logs ? Is it flooding user presence state update (if a user state blinks rapidly between two states it may be related to multitab) or is it something else ?

@ChristophWolffNabu
Copy link

Still an issue on v1.118.0

@ChristophWolffNabu
Copy link

Fixed it temp. by closing al Browsertabs

@rda0
Copy link

rda0 commented Nov 5, 2024

A workaround could be to rate-limit it on the reverse proxy. I did this similarly for #16987.

For example using nginx this could look something like (this probably also limits sync requests):

http {
    map $query_string $map_query_param_matrix_client_sync_set_presence {
        default "";
        ~(^|&)set_presence= $binary_remote_addr;
    }
    map $request_method $map_request_method_matrix_client_sync_set_presence {
        default "";
        GET $map_query_param_matrix_client_sync_set_presence;
    }
    limit_req_zone $map_request_method_matrix_client_sync_set_presence zone=matrix_client_sync_set_presence:10m rate=2r/s;

    server {
        location ~ ^/_matrix/client/v3/sync {
            limit_req zone=matrix_client_sync_set_presence burst=4 nodelay;
            proxy_pass http://$workers;
            proxy_read_timeout 600s;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $remote_addr;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_http_version 1.1;
            client_max_body_size 125M;
            access_log /var/log/nginx/matrix._matrix.log vhost_combined_tls_worker_withreq;
        }
    }
}

@rda0
Copy link

rda0 commented Dec 4, 2024

This is still a problem with v1.120.2. It is not always a problem but often recently, usually on Mondays, this week every day but not all day long. So I still think this is client related but since we have basically no control over the clients it must be fixed on the server side.

How can we debug/fix this on the server side?
I think #16705 could be a viable solution.

Today I tracked down all Element Web/Desktop users/computers with versions older than Element/1\.11\.8[567] and rejected their IPs on the server. This alone did not stop the presence storm. But the CPU usage went down after a Synapse restart (which could be a coincidence). CPU usage stayed low even after unblocking the IPs.

image

@bjoe2k4
Copy link

bjoe2k4 commented Dec 4, 2024

For me things luckily have calmed down since July and stayed pretty stable since then.

grafik

I've considered rate-limiting by IP on the reverse proxy, but this immediately becomes an issue as soon as multiple users are connected via NAT (VPN or WiFi).

So, the rate-limiting if implemented correctly has to happen server side and on a per authenticated user basis.

@rda0
Copy link

rda0 commented Dec 4, 2024

So, the rate-limiting if implemented correctly has to happen server side and on a per authenticated user basis.

I agree.

I implemented an rc_set_presence.per_user ratelimit in rda0@edb93a0. It is live on my test server and appears to work (it did not explode and a quick presence test looks good). I will test it on my production server on the next occasion to see if it helps.

@rda0
Copy link

rda0 commented Dec 5, 2024

I created a PR #18000 which I tested toady on our production server with success. A detailed report is in #18000 (comment). Would be great if somebody else could test it.

Patch:
patch_of_18000_for_v1.120.2.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests