Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #453

Merged

Conversation

henrinormak
Copy link
Contributor

@henrinormak henrinormak commented Feb 16, 2023

I think I found a few places that can be tuned to help increase performance in setups that have many parallel test cases, and to put these into context, I used one of the test sets I have access to where testcontainers is being rolled out.

The test set is fairly large, consisting of about 1300 jest test cases (in about 180 test suites), where almost each test case touches a container - container reuse is enabled, however they still go through GenericContainer and the reuse logic there (i.e no memory level resource sharing between test cases). All of the test-cases are marked concurrent and through all of my test runs, I used 4 jest workers (so 4 test suites were running in parallel).

I ran each scenario 5 times, probably not statistically significant, but as the test set is pretty big, it takes too long to run them more than that. I did discard any test run that failed (there are some test cases that are flaky and fail sometimes), but all in all, <5 runs were discarded.

Here's the baseline I captured with version 9.1.3:

Run 1 - Done in 351.90s.
Run 2 - Done in 337.98s.
Run 3 - Done in 309.03s.
Run 4 - Done in 339.93s.
Run 5 - Done in 316.15s.
= Average 330.99s

After moving the auth config lookup to only occur when an image actually needs pulling, adding a lock around image pulling and avoiding some logic if logging is disabled (mainly determining the system statistics) [1]:

Run 1 - Done in 307.70s.
Run 2 - Done in 340.78s.
Run 3 - Done in 290.94s.
Run 4 - Done in 288.53s.
Run 5 - Done in 278.26s.
= Average 301.24s

After optimising the way the image existence check occurs [2]:

Run 1 - Done in 284.07s
Run 2 - Done in 272.30s.
Run 3 - Done in 270.35s.
Run 4 - Done in 298.30s.
Run 5 - Done in 278.50s.
=  Average 280.70s

Reasoning

[1] - The auth check does not need to occur if we decide to not pull the image, this also fixes a bug, as previously runInContainer did not do the auth config checkup, which I assume it should have. The lock I added to avoid concurrent tests starting to pull an image that they all share, this way we only pull the image once and then determine that it already exists for the other checks. Determining the statistics, although spun off as a promise that is not awaited, I still think it does not make sense if logging is not enabled (which is probably in majority of cases).

[2] - Previously every single check fetched the entire list of images from Docker, which on certain machines can be quite a large set. This new approach caches the results and again also applies a lock to avoid parallel checks for the same image from doing parallel work. Images that do not exist always update the full list of images we have.

Followup

I suspect there is some more room for improvement, one thing I notice is the default waiting strategy, which uses Promise.all for the internal port check commands. It could probably be improved by using a modified approach with Promise.any, allowing for resolving fast in case one of the commands is faster then the others and gives a positive answer. I left this out as Promise.any was introduced in Node 15, and if I understand correctly, this project also supports Node 14.

It is also possible that the execContainer utility could be improved to avoid polling, at least dockerode's issue on this seems to suggest that latest versions of docker have a shim that solves the original issue why stream.on('end') was not properly working on Windows (not sure whether that is the reason why this library uses the polling approach or not).

@cristianrgreco
Copy link
Collaborator

Thanks @henrinormak, these are great findings and and an excellent write-up! I'll review this shortly. I will also look into if the polling can be dropped from the exec, I think it would make my day if it could be 😄

@henrinormak
Copy link
Contributor Author

Locally on macOS and Docker version 20.10.22, build 3a2c30b, I am indeed able to refactor the polling to instead wait on stream end. I don't have a windows machine at hand to test it out on Win.

Not sure what kind of an impact (if any) it has on the performance

@cristianrgreco
Copy link
Collaborator

@henrinormak if you have that change ready I'd appreciate the PR, and I can see how to get it tested on Windows, no worries if not, I'll look into it when I can

@henrinormak
Copy link
Contributor Author

@cristianrgreco pushed the change to this branch

Copy link
Collaborator

@cristianrgreco cristianrgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's... beautiful 😄

@cristianrgreco cristianrgreco added enhancement New feature or request patch Backward compatible bug fix minor Backward compatible functionality and removed patch Backward compatible bug fix labels Feb 17, 2023
@cristianrgreco
Copy link
Collaborator

Just tested this on a Windows machine running Docker 20.10.22 and it works


export const imageExists = async (dockerode: Dockerode, imageName: DockerImageName): Promise<boolean> => {
log.debug(`Checking if image exists: ${imageName}`);
return (await listImages(dockerode)).some((image) => image.equals(imageName));

return imageCheckLock.acquire(imageName.toString(), async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to provide a timeout here, in case something goes wrong and the lock can't be acquired? Likewise does the lock need to be released in some finally block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock implementation handles the release in case the user code throws, it also allows configuring a timeout, but for now I left it out, as presumably the underlying operations already have timeouts?

@cristianrgreco cristianrgreco merged commit 1533445 into testcontainers:main Feb 28, 2023
@henrinormak henrinormak deleted the performance-improvements branch March 7, 2023 02:24
@cristianrgreco cristianrgreco changed the title refactor: performance improvements Performance improvements Mar 8, 2023
@henrinormak henrinormak mentioned this pull request Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor Backward compatible functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants