-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve caching #1005
improve caching #1005
Conversation
key: gradle-${{ hashFiles('**/*.gradle') }} | ||
path: | | ||
/tmp/docker-registry | ||
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }} | ||
restore-keys: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker can (and should) use restore-keys since some of the layers will match even if the contents change
key: gradle-${{ hashFiles('**/*.gradle') }} | ||
path: | | ||
/tmp/docker-registry | ||
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using Dockerfile hashes even though the contents of the image might change, because this is only used to speed up builds, not to decide if it should skip a Docker build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is github.sha
intended? wouldn't this basically be discarded on every commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is intended to store the cache entries with different keys. However, restore-keys allows it to read from different keys. It makes it easier to benchmark re-running builds for Docker with almost no downside (except maybe more cache evictions since it isn't overwriting keys as much).
path: ~/.npm | ||
key: node-${{ hashFiles('**/package-lock.json') }} | ||
path: | | ||
~/.cache/pip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how you use pip, more might be stored here. Right now this is super lightweight. It also allows restore-keys because this can be used by all venvs.
.github/workflows/gradle.yml
Outdated
with: | ||
path: | | ||
~/.npm | ||
**/node_modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat debatable that we should have node_modules in a place that uses restore-keys, but it seems like a much smaller issue than python venvs or gradle caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't ~/.npm
effectively the cache where node_modules is first populated from? does it make a difference to include node_modules directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it isn't recommended. I'll remove.
uses: actions/cache@v2 | ||
with: | ||
path: | | ||
~/.gradle/caches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Github Action cache examples prefer to cache the specific directories for gradle instead of the entire ~/.gradle
.
This intentionally does not use restore-key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment saying this "This intentionally does not use restore-key."?
- name: Check documentation | ||
if: success() && github.ref == 'refs/heads/master' | ||
run: ./tools/site/link_checker.sh check_docs | ||
|
||
- name: Write Integration Test Credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this block down to make sure that ./gradlew build
doesn't accidentally include a step that depends on credentials.
@@ -20,17 +20,15 @@ dependencies { | |||
|
|||
// on rc version due to docker pull bug in current stable version. | |||
// issue: https://github.com/airbytehq/airbyte/issues/493 | |||
testCompile "org.testcontainers:mssqlserver:1.15.0-rc2" | |||
testImplementation "org.testcontainers:mssqlserver:1.15.0-rc2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compile is deprecated
tools/bin/build_image.sh
Outdated
--iidfile "$ID_FILE" | ||
else | ||
# run build with local docker registery for CI | ||
docker pull localhost:5000/"$TAG" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing in a local repository with a custom data dir (which is also started by Github Actions) seems like the best balance between build speed and cache speed.
key: gradle-${{ hashFiles('**/*.gradle') }} | ||
path: | | ||
/tmp/docker-registry | ||
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is github.sha
intended? wouldn't this basically be discarded on every commit?
tools/bin/build_image.sh
Outdated
--iidfile "$ID_FILE" | ||
else | ||
# run build with local docker registery for CI | ||
docker pull localhost:5000/"$TAG" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a silly question but shouldn't this also include the image name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here "$TAG" is the tagged image name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed so it's more clear
.github/workflows/gradle.yml
Outdated
with: | ||
path: | | ||
~/.npm | ||
**/node_modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't ~/.npm
effectively the cache where node_modules is first populated from? does it make a difference to include node_modules directly?
@@ -43,6 +64,19 @@ jobs: | |||
with: | |||
python-version: '3.7' | |||
|
|||
- name: Start local Docker registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this interact with the docker cache? the idea is we get access to the Docker Caching
above? i'm not sure i full understand how these two work together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses /tmp/docker-registry
as the data directory for a local registry. When build_image.sh
pushes data to the local registry, this is what's being cached in Docker Caching
(which creates a post-step at the end which saves the contents).
@@ -45,11 +45,17 @@ class AirbytePythonPlugin implements Plugin<Project> { | |||
project.task('installReqs', type: PythonTask) { | |||
module = "pip" | |||
command = "install -r requirements.txt" | |||
inputs.file('requirements.txt') | |||
outputs.file('build/installedreqs.txt') | |||
outputs.cacheIf { true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the implication of this? it just always caches its output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells it that it's cacheable, but it still needs to re-run if the inputs change.
In general I would be in favor of your putting your inline github comments into the code as actual comments. |
For #991
This PR separates caching into separate steps that correspond to different keys. I go into detail on this in comments on the PR.
I tried different methods from https://dev.to/dtinth/caching-docker-builds-in-github-actions-which-approach-is-the-fastest-a-research-18ei. In our case, we have so many large images that some methods (like the save/load) simply take up more disk space than is allowed on Github Actions. What I have here looks to me like the best balance, especially for split builds.
This doesn't speed up the CI build as much as I would like. The
Build
step is cut almost in half, but there's more overhead and it doesn't catch every case. The biggest benefit is that local builds are sped up significantly. Once the builds are split we'll have a much faster iteration cycle on CI too (lower overhead and fewer edge cases to target fixing).Latest successful build: https://github.com/airbytehq/airbyte/runs/1426297335 / https://scans.gradle.com/s/4vlk24ao7da3m