Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve caching #1005

Merged
merged 38 commits into from
Nov 19, 2020
Merged

improve caching #1005

merged 38 commits into from
Nov 19, 2020

Conversation

jrhizor
Copy link
Contributor

@jrhizor jrhizor commented Nov 17, 2020

For #991

This PR separates caching into separate steps that correspond to different keys. I go into detail on this in comments on the PR.

I tried different methods from https://dev.to/dtinth/caching-docker-builds-in-github-actions-which-approach-is-the-fastest-a-research-18ei. In our case, we have so many large images that some methods (like the save/load) simply take up more disk space than is allowed on Github Actions. What I have here looks to me like the best balance, especially for split builds.

This doesn't speed up the CI build as much as I would like. The Build step is cut almost in half, but there's more overhead and it doesn't catch every case. The biggest benefit is that local builds are sped up significantly. Once the builds are split we'll have a much faster iteration cycle on CI too (lower overhead and fewer edge cases to target fixing).

Latest successful build: https://github.com/airbytehq/airbyte/runs/1426297335 / https://scans.gradle.com/s/4vlk24ao7da3m

@jrhizor jrhizor changed the title python caching improve caching Nov 19, 2020
key: gradle-${{ hashFiles('**/*.gradle') }}
path: |
/tmp/docker-registry
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }}
restore-keys: |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker can (and should) use restore-keys since some of the layers will match even if the contents change

key: gradle-${{ hashFiles('**/*.gradle') }}
path: |
/tmp/docker-registry
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Dockerfile hashes even though the contents of the image might change, because this is only used to speed up builds, not to decide if it should skip a Docker build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is github.sha intended? wouldn't this basically be discarded on every commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is intended to store the cache entries with different keys. However, restore-keys allows it to read from different keys. It makes it easier to benchmark re-running builds for Docker with almost no downside (except maybe more cache evictions since it isn't overwriting keys as much).

path: ~/.npm
key: node-${{ hashFiles('**/package-lock.json') }}
path: |
~/.cache/pip
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how you use pip, more might be stored here. Right now this is super lightweight. It also allows restore-keys because this can be used by all venvs.

with:
path: |
~/.npm
**/node_modules
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat debatable that we should have node_modules in a place that uses restore-keys, but it seems like a much smaller issue than python venvs or gradle caching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't ~/.npm effectively the cache where node_modules is first populated from? does it make a difference to include node_modules directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it isn't recommended. I'll remove.

uses: actions/cache@v2
with:
path: |
~/.gradle/caches
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Github Action cache examples prefer to cache the specific directories for gradle instead of the entire ~/.gradle.

This intentionally does not use restore-key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment saying this "This intentionally does not use restore-key."?

- name: Check documentation
if: success() && github.ref == 'refs/heads/master'
run: ./tools/site/link_checker.sh check_docs

- name: Write Integration Test Credentials
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this block down to make sure that ./gradlew build doesn't accidentally include a step that depends on credentials.

@@ -20,17 +20,15 @@ dependencies {

// on rc version due to docker pull bug in current stable version.
// issue: https://github.com/airbytehq/airbyte/issues/493
testCompile "org.testcontainers:mssqlserver:1.15.0-rc2"
testImplementation "org.testcontainers:mssqlserver:1.15.0-rc2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile is deprecated

--iidfile "$ID_FILE"
else
# run build with local docker registery for CI
docker pull localhost:5000/"$TAG" || true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing in a local repository with a custom data dir (which is also started by Github Actions) seems like the best balance between build speed and cache speed.

@jrhizor jrhizor marked this pull request as ready for review November 19, 2020 19:12
key: gradle-${{ hashFiles('**/*.gradle') }}
path: |
/tmp/docker-registry
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is github.sha intended? wouldn't this basically be discarded on every commit?

--iidfile "$ID_FILE"
else
# run build with local docker registery for CI
docker pull localhost:5000/"$TAG" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a silly question but shouldn't this also include the image name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here "$TAG" is the tagged image name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed so it's more clear

with:
path: |
~/.npm
**/node_modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't ~/.npm effectively the cache where node_modules is first populated from? does it make a difference to include node_modules directly?

@@ -43,6 +64,19 @@ jobs:
with:
python-version: '3.7'

- name: Start local Docker registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this interact with the docker cache? the idea is we get access to the Docker Caching above? i'm not sure i full understand how these two work together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses /tmp/docker-registry as the data directory for a local registry. When build_image.sh pushes data to the local registry, this is what's being cached in Docker Caching (which creates a post-step at the end which saves the contents).

@@ -45,11 +45,17 @@ class AirbytePythonPlugin implements Plugin<Project> {
project.task('installReqs', type: PythonTask) {
module = "pip"
command = "install -r requirements.txt"
inputs.file('requirements.txt')
outputs.file('build/installedreqs.txt')
outputs.cacheIf { true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the implication of this? it just always caches its output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells it that it's cacheable, but it still needs to re-run if the inputs change.

@cgardens
Copy link
Contributor

In general I would be in favor of your putting your inline github comments into the code as actual comments.

@jrhizor jrhizor merged commit 05404bd into master Nov 19, 2020
@jrhizor jrhizor deleted the jrhizor/python-caching branch November 19, 2020 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants