Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds ssh debugger for selenium test #498

Closed
wants to merge 12 commits into from

Conversation

robacarp
Copy link
Contributor

No description provided.

@krschacht
Copy link
Contributor

@robacarp I suspect you'll get what you need from this running in the PR, but if you need to merge into main for testing, go right ahead

@robacarp robacarp force-pushed the system-tests-rob branch 2 times, most recently from 3fe08f4 to 9e162a5 Compare August 23, 2024 20:22
@robacarp robacarp force-pushed the system-tests-rob branch 5 times, most recently from d66f151 to fb396f0 Compare August 23, 2024 21:21
@robacarp
Copy link
Contributor Author

@krschacht Alright, here's what I think is happening. I don't have a solution yet, but it's a race condition somewhere between chromedriver and selenium. In #483 you specc'd for a bigger runner because there was suspicion that it was a memory issue, but that probably made the problem worse because now ruby boots faster but chromedriver still takes forever? A short term reduction in pain might be to spec-down the runner.

The Net::ReadTimeout is coming from capybara (aka selenium) failing to hit chromedriver when attempting to set up the server. This stack trace is all sorts of interesting:

  • The tests run (and fail) before puma is started by capybara
  • The test hung because the server was still running and ruby wouldn't exit
  • It says the TCP socket was closed -- does this means the socket was open when it started but closed during the exchange? Or that it was never open? I suspect the former because the stack trace is in the middle of a read loop.
  • The failure is in the area of code which causes chromedriver to build a new session (ie, start chrome up):
   ...
    vendor/bundle/ruby/3.2.0/gems/actionpack-7.1.3.2/lib/action_dispatch/system_testing/driver.rb:55:in `block in register_selenium'
    <internal:kernel>:90:in `tap'
    vendor/bundle/ruby/3.2.0/gems/actionpack-7.1.3.2/lib/action_dispatch/system_testing/driver.rb:54:in `register_selenium'
    vendor/bundle/ruby/3.2.0/gems/actionpack-7.1.3.2/lib/action_dispatch/system_testing/driver.rb:41:in `block in register'
    vendor/bundle/ruby/3.2.0/gems/capybara-3.39.2/lib/capybara/session.rb:105:in `driver'
    vendor/bundle/ruby/3.2.0/gems/capybara-3.39.2/lib/capybara/session.rb:91:in `initialize'
   ...

I added wait-for-it.sh but the problem isn't that chromedriver isn't ready, because it still fails when waitforit passes.

This is also an interesting failure: https://github.com/AllYourBot/hostedgpt/pull/498/checks#step:9:37 It seems like everything booted up in the right order, but the tests just never ran (I cancelled the job after ~3min). It makes me wonder if there's some confusion in capybara connecting to chromedriver, and it's waiting for chromedriver? Is the PORT 9515 assumption in the chromedriver github action plugin wrong? (Or is it not correctly installed?)

@krschacht
Copy link
Contributor

@robacarp another big clue in all this is that it all used to work fine, for many months, and then this all started one day. Does that suggest anything else to you or a possible solution?

I even tried rolling back to the point in our commit history where everything was working reliably but it no longer does. I wonder if a dependency has a new version rolled out but if we could pin to an older version this might fix? But rolling back should have done that with gemfile.lock so maybe it’s a dependency that’s somehow outside of that, like part of GitHub action runner config?

@robacarp
Copy link
Contributor Author

I don't think this is anything to do with the code here, unfortunately. I think the fact that it ever worked was luck. I added a DEBUG environment variable to the run here and there's a lot more output from many of the tools in the process, but nothing jumps out.

Another thing I forgot to mention that suggests an inter-process race condition, when I SSH into the job mid-run, it sometimes fails or hangs for a bit. But if I interrupt the process (^c) and then re-run it, it goes fine.

@krschacht
Copy link
Contributor

Dang, it still seems really odd that for ~6 months these tests never hit this issue and the one day it starts hitting the issue regularly. That's what makes me think that an updated chromedriver or ubuntu-latest system dependency—something outside of gemfile.lock—changed and that introduced this issue. That's the only way I can explain rolling back the repo and still getting the error.

Also, I never get this issue locally on my dev machine. I only get it in github actions. Initially, I suspected this was because my dev machine has more resources (cores, etc) so that's what prompted me to upgrade the github actions server. And notably, that did help. When I was using the default ubuntu-latest I ran the test suite 10 times in a row and it failed every time. I don't think I ever got it to pass successfully. But when I upgraded to the larger instances then it's now passes about 50% of the time.

In terms of working around this, would it be possible to study the stack trace and find a place within the selenium or capybara code where you could re-open the class, re-write a method to add a simple "sleep 10" or something hacky like that in the right place within the boot cycle to work around the race?

@robacarp
Copy link
Contributor Author

https://github.com/AllYourBot/hostedgpt/pull/498/files#diff-58be79ee432aa1a1e12529b3141879d9c9503ac6e071c47e2d53462d339362b7R13-R21

This sort of works -- but if capybara doesn't start the server until after the tests finish, it still hangs.

@robacarp
Copy link
Contributor Author

Sorry, this -- I didn't realize I rolled back the retry code:

def self.test(name, &block)
  begin
    super(name, &block)
  rescue Net::ReadTimeout => e
    puts "(caught) Net::ReadTimeout: #{e}"
    retry_count ||= 0
    retry_count += 1
    retry if retry_count < 3
  end
end

@krschacht
Copy link
Contributor

Is there some place you think you could introduce a delay to guarantee capybara gets the server started?

@robacarp
Copy link
Contributor Author

Not that I've seen. I don't think it's a delay in chromedriver -- I added a wait to ensure that gets started, but it's still failing. It's so strange that it's attempting to run the tests before capybara has started the server.

@matthewbennink
Copy link
Contributor

Might be worth looking at Cuprite to see if it may get around the issue. See this article and the cuprite project.

@krschacht
Copy link
Contributor

Hi @robacarp, Matthew's suggestion is an interesting one but I also got another lead. I opened an issue on Selenium and someone explained how they resolved this by pinning a chromium version:
teamcapybara/capybara#2770 (comment)

@krschacht
Copy link
Contributor

Just to keep capture this idea, on that other Issue someone claims that he solved a bunch of issues by switching to Playwright and linked to an article: teamcapybara/capybara#2770 (comment)

@krschacht krschacht closed this Oct 17, 2024
@robacarp robacarp deleted the system-tests-rob branch October 17, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants