Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and document timeouts across components #525

Closed
eloquence opened this issue Apr 2, 2020 · 10 comments
Closed

Investigate and document timeouts across components #525

eloquence opened this issue Apr 2, 2020 · 10 comments
Labels

Comments

@eloquence
Copy link
Member

The client performs many network operations with associated timeouts. If timeouts are too short, operations may fail; if they are too long, user feedback may be delayed. Timeouts are negotiated at different levels of the stack, e.g.:

  • client connection to proxy
  • proxy connection to API
  • Apache<->wsgi connection

When shorter timeouts at one level override longer ones at another, this can lead to unexpected results, as we saw during investigation of freedomofpress/securedrop-client#1007. There also different types of timeouts (e.g., ConnectTimeout vs. ReadTimeout) which may need to be set to different values.

We should more document the different timeouts, and the overall connection architecture, to ensure all developers can consistently reason about the expected behavior of the whole application. This can be done in the workstation wiki for now.

@eloquence eloquence added the docs label Apr 2, 2020
@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Apr 7, 2020

adding notes as i go...

@sssoleileraaa
Copy link
Contributor

as of now, my understanding is that we have two main issues, the rest is just cleanup/refactoring:

  1. we are timing out when there are above a hundred sources or so (I need to get back into testing today to determine how frequent these timeouts become - I'm pretty sure it still times out every single time at 1000 sources)
  2. [needs investigation] we might be taking too long before we tell the user that there was an error with the login session when trying to authenticate

@rmol is speeding up the wsgi process by using a key cache (getting source keys takes a long time) and word is he will be adding usage of nginx, see freedomofpress/securedrop#5184, but we will still need to confirm whether or not this fixes the frequent timeout issue

@rmol
Copy link
Contributor

rmol commented Apr 7, 2020

if the proxy is able to make a connection, or if the server sends data, in less than 120 seconds then the proxy will continue to wait for however long it takes to get the full response

To be annoyingly pedantic, once the connection is established, if the server were ever to go 120 seconds without sending anything, the proxy would time out the request. For our purposes, it's almost always the same thing, as we're generally shipping entire responses and not trickling data. And right now we'll hit the 60-second Apache TimeOut first.

I'm pretty sure it still times out every single time at 1000 sources

I'm not seeing this with my staging environment. With just the fingerprint caching, responses take less than 30 seconds. With the key caching added in freedomofpress/securedrop#5184 they're under 20 seconds. Could be differences in our hardware or VM performance. (And yes, this is still pretty ridiculous for the amount of work happening and the size of the responses. A bunch of this is Tor, which can be helped by compressing the responses, but we're still taking several seconds to produce a relatively small amount of JSON, and I think it could still use more scrutiny.)

Just for the benefit of those following along at home, I don't think we're planning on nginx any time soon? Certainly not as part of freedomofpress/securedrop#5184. There is an issue for that (freedomofpress/securedrop#2414) but I haven't heard anything about it recently.

@sssoleileraaa
Copy link
Contributor

To be annoyingly pedantic, once the connection is established, if the server were ever to go 120 seconds without sending anything, the proxy would time out the request. For our purposes, it's almost always the same thing, as we're generally shipping entire responses and not trickling data. And right now we'll hit the 60-second Apache TimeOut first.

Totally, that's my understanding too, which is why i mentioned how this should probably be updated to 60 seconds to match the apache config. However, my statement about the proxy waiting 120 seconds is true about the way it's configured as of now. If the apache config timeout was updated to say 180 seconds, the proxy would still timeout after 120 seconds, since that is what is specified.

I'm not seeing this with my staging environment.

I will test again. It could be a hardware issue, but it sounds like the 1000 source timeout issue has been fixed for you ever since you updated the proxy timeout to 120 seconds, which allowed us to wait until 60 seconds for a connection? Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct? Something is not lining up here. I'm going to run some new tests and will report back.

@sssoleileraaa
Copy link
Contributor

I'm not seeing this with my staging environment.

I will test again. It could be a hardware issue, but it sounds like the 1000 source timeout issue has been fixed for you ever since you updated the proxy timeout to 120 seconds, which allowed us to wait until 60 seconds for a connection? Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct? Something is not lining up here. I'm going to run some new tests and will report back.

I just tested the latest client on Qubes against my staging server with 1000 sources and still see no sources populated in the source list after 10 minutes due to RequestTimeoutErrors, for more info about what I saw and which version of the client I ran, see:

freedomofpress/securedrop-client#1025 (comment)

I'm concerned we are still continuing to see timeouts as much as one third of the time when there are 200 sources, see freedomofpress/securedrop-client#1007 (comment), so I will test a server with 200 sources next.

@redshiftzero
Copy link
Contributor

I don't think we're planning on nginx any time soon?

Nope, this is not in the near or medium term - to expand a bit, we don't have a good way of performing the migration that doesn't add a lot of burden for administrators. The bionic upgrade might be the next window to do it.

Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct?

👍

Backing up for a second, before we did the final release before the first pilot provisioning, I tested master of securedrop-client at that time with 500 sources using freedomofpress/securedrop-proxy#70 as a check before we did the pilot release to ensure that I did not see timeouts. I should note I did not delete my existing database before doing this, which I should have done in hindsight. This was on a version of SecureDrop server that did have the fingerprint cache.

I just added another 500 sources (1000 total) on the server, deleted my client database, and ran with master (56e0981936838c8e24c55a6fdb3b66ced227b22b) of the client. Here's the timings logged from the client-side:

17:29:19,996 - Login success
17:30:40,138 - Sync failure
17:32:15,163 - Sync failure
17:33:50.161 - Sync failure
17:35:43,642 - First log line for get_remote_data - Fetched 1000 remote sources, 1998 remote submissions, 2046 remote replies. update_sources took 0.2768s, update_files took 0.0186s, update_messages took 1.5341s, update_replies took 4.9750s.

Sync does continue to show intermittent failures - they are intermittent.

Otherwise it sounds like everyone is using latest master in the client directory, latest securedrop-proxy with the 120s timeout, and a version of SecureDrop server that has the fingerprint cache. One possible source of divergence is that someone has a particularly slow tor circuit (which would be a useful observation since this can also happen in production).

I think next steps are: we compare directly the response time of our get_sources endpoints through tor as a sanity check. While the desired goal of all this is to determine "for normal conditions, what are sufficiently long timeout values for sync operations to set such that errors won't be needlessly raised to the user", for freedomofpress/securedrop-client#1025 we're just trying to set the timeout value for these metadata sync API calls correctly. Since there's some divergence, we could see what the response time of the direct API call is to ensure that something else isn't going on - i.e., testing with cron, and removing the client entirely from the equation. The distribution of that response time (for a representative sample of tor circuits) tells us what the timeout value should be set out on our side.

@eloquence
Copy link
Member Author

For the 4/22-5/6 sprint, @creviera has agreed to clean up these notes a bit and copy them to the wiki, with review/input from @rmol. Then we can close out this issue.

@sssoleileraaa
Copy link
Contributor

Wiki updates can be found here: https://github.com/freedomofpress/securedrop-workstation/wiki/Timeouts

@eloquence
Copy link
Member Author

Thank you for the write-up Allie! :) @rmol will take a spin sometime during this sprint (5/6-5/20), then we can close out this issue.

@rmol
Copy link
Contributor

rmol commented May 12, 2020

Wiki summary looks good. Closing.

@rmol rmol closed this as completed May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants