Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failing with mounting volume error #7659

Closed
oxarbitrage opened this issue Oct 2, 2023 · 15 comments · Fixed by #7662, #7665, #7686 or #7690
Closed

CI failing with mounting volume error #7659

oxarbitrage opened this issue Oct 2, 2023 · 15 comments · Fixed by #7662, #7665, #7686 or #7690
Assignees
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles

Comments

@oxarbitrage
Copy link
Contributor

oxarbitrage commented Oct 2, 2023

https://github.com/ZcashFoundation/zebra/actions/runs/6382508836/job/17321768540?pr=7653#step:13:188

docker: Error response from daemon: error while mounting volume '/var/lib/docker/volumes/fully-synced-rpc-85b855b/_data': failed to mount local volume: mount /dev/sdb:/var/lib/docker/volumes/fully-synced-rpc-85b855b/_data: device or resource busy.
Error: Process completed with exit code 125.

This seems to be happening for every PR we have opened recently. We can post more links to failures and more information in this ticket.

@oxarbitrage oxarbitrage added A-devops Area: Pipelines, CI/CD and Dockerfiles P-Critical 🚑 labels Oct 2, 2023
@mpguerra mpguerra added this to Zebra Oct 2, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra Oct 2, 2023
@oxarbitrage oxarbitrage self-assigned this Oct 2, 2023
@oxarbitrage
Copy link
Contributor Author

It is not the first time in the CI history we see this error. One option is to find what we did at that time and try to repeat it or get information from it.

@oxarbitrage
Copy link
Contributor Author

oxarbitrage commented Oct 3, 2023

In a closer inspection, I've observed that the error is not present in every open pull request, but specifically in the one linked here.

The error is visible in the Zebra Tip JSON-RPC job, while none of the other open pull requests encounter this issue in the same job or with the specific error message.

All other pull requests pass this job and start failing in the lightwalletd tip update job with a 'The operation was canceled.' error.

So, we should pay attention only to what is happening in this specific pull request in this ticket:

  • It is using this image: zebrad-cache-7633-merge-7b222f7-v25-mainnet-tip-u-174603. You can find this information in the Find fully-synced-rpc cached state disk job.
  • The job is failing in Zebra Tip JSON-RPC, but it failed only once in this particular pull request and nowhere else.

I suggest deleting the possibly corrupted image from gcloud and restarting the CI for this pull request. Another image will be selected, if this was a one time issue, then the CI should pass the Zebra Tip JSON-RPC job.

This will help us determine if the problem was related to the image, and it may also lower the priority of the ticket.

@gustavovalverde , please let me know your thoughts on this. Any additional input is welcome.

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

Have we tried restarting CI without deleting any images?
It might have been a temporary issue on that Google Cloud machine, and we'll get a new machine when we restart.

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

(It is unlikely that a "device or resource busy" error would be caused by a specific image, because they are usually about open files or devices.)

@oxarbitrage
Copy link
Contributor Author

Makes sense, restarting all jobs at https://github.com/ZcashFoundation/zebra/actions/runs/6382508836?pr=7653 for the PR with the issue.

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

a closer inspection, I've observed that the error is not present in every open pull request, but specifically in the one linked here.

The error is visible in the Zebra Tip JSON-RPC job, while none of the other open pull requests encounter this issue in the same job or with the specific error message.

Sorry about that, I thought I had checked multiple PRs, but I might have accidentally checked the same PR multiple times.

@gustavovalverde
Copy link
Member

This PR did not work:

@teor2345
Copy link
Contributor

teor2345 commented Oct 4, 2023

We could re-run the entire docker command a limited number of times until it succeeds?

@teor2345
Copy link
Contributor

teor2345 commented Oct 6, 2023

This wasn't completely fixed by PR #7686, but it's a lot better now.

Maybe we can drop it down from critical to high priority?

@gustavovalverde
Copy link
Member

I’m waiting for the latest commit to run, but I was able to found the issue while deploying the instances manually, as running dmesg was outputting the following message when I tried to mount /dev/sbd in Docker:

/dev/sdb: Can't open blockdev

And this only happened after creating the Docker volume.

I added a new commit (which I’ve tested at least 3 times), and it’s not failing to mount: 398c2f1

But I’ll keep testing to confirm

@teor2345
Copy link
Contributor

teor2345 commented Oct 8, 2023

@teor2345
Copy link
Contributor

teor2345 commented Oct 8, 2023

@teor2345
Copy link
Contributor

teor2345 commented Oct 9, 2023

What if using block storage is part of our issue?
It's only recommended for experts in the docker docs:
https://docs.docker.com/storage/volumes/#block-storage-devices

Is there a way to let docker handle the devices automatically, without us having to initialise them?

@mergify mergify bot closed this as completed in #7690 Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment