Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve caching #1005

Merged
merged 38 commits into from
Nov 19, 2020
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
8ee8b7e
python caching
jrhizor Nov 17, 2020
0935798
improve gradle caching
jrhizor Nov 17, 2020
8ba3b92
store pip cache dir
jrhizor Nov 17, 2020
76001e7
debugging info
jrhizor Nov 17, 2020
4130412
split build steps and temporarily remove testing infra from build
jrhizor Nov 17, 2020
1629280
fix formatting error
jrhizor Nov 17, 2020
aabe54d
fix another syntax error
jrhizor Nov 17, 2020
af74819
split repo caches
jrhizor Nov 18, 2020
f175856
switch pip dir
jrhizor Nov 18, 2020
215f47b
use git hash instead of rehashing everything
jrhizor Nov 18, 2020
2b58a96
syntax
jrhizor Nov 18, 2020
81a1341
list directories
jrhizor Nov 18, 2020
af805ec
list after cache population
jrhizor Nov 18, 2020
244f379
list .docker
jrhizor Nov 18, 2020
f74b1b9
specify custom output cache directory for docker
jrhizor Nov 18, 2020
3b61123
only build bigquery and use buildx on ci
jrhizor Nov 18, 2020
486540f
add build command
jrhizor Nov 18, 2020
b20b554
use docker driver instead of docker-container
jrhizor Nov 18, 2020
49f0034
use correct key
jrhizor Nov 18, 2020
7d0b938
use local registry
jrhizor Nov 18, 2020
103ad7e
allow push failure
jrhizor Nov 18, 2020
ebd2a5f
make sure to tag first
jrhizor Nov 18, 2020
99ce830
move part of node caching and restore rest of actual build
jrhizor Nov 18, 2020
25eb722
cleanup
jrhizor Nov 18, 2020
2ed835a
Merge branch 'master' into jrhizor/python-caching
jrhizor Nov 18, 2020
c505225
clean up mssql gradle file
jrhizor Nov 18, 2020
0a5f588
toggle on inline cache
jrhizor Nov 18, 2020
9da8693
try save and load method instead
jrhizor Nov 19, 2020
d5424ab
use separate tar for each image
jrhizor Nov 19, 2020
defe36b
limit to 20
jrhizor Nov 19, 2020
4fa24f7
always succeed even though disk space runs out
jrhizor Nov 19, 2020
c57b12b
try to just pull the public image to see if that's faster
jrhizor Nov 19, 2020
379b63f
use cachefrom
jrhizor Nov 19, 2020
10996a3
revert back to local registry version
jrhizor Nov 19, 2020
a9062d4
Merge branch 'master' into jrhizor/python-caching
jrhizor Nov 19, 2020
fa1d783
see how buildkit compares
jrhizor Nov 19, 2020
2b79bfb
address review comments
jrhizor Nov 19, 2020
1206351
add comments
jrhizor Nov 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 43 additions & 18 deletions .github/workflows/gradle.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,43 @@ jobs:
- name: Check images exist
run: ./tools/bin/check_images_exist.sh

- name: Cache java deps
- name: Docker Caching
uses: actions/cache@v2
with:
path: ~/.gradle
key: gradle-${{ hashFiles('**/*.gradle') }}
path: |
/tmp/docker-registry
key: docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-${{ github.sha }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Dockerfile hashes even though the contents of the image might change, because this is only used to speed up builds, not to decide if it should skip a Docker build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is github.sha intended? wouldn't this basically be discarded on every commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is intended to store the cache entries with different keys. However, restore-keys allows it to read from different keys. It makes it easier to benchmark re-running builds for Docker with almost no downside (except maybe more cache evictions since it isn't overwriting keys as much).

restore-keys: |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker can (and should) use restore-keys since some of the layers will match even if the contents change

gradle-
docker-${{ runner.os }}-${{ hashFiles('Dockerfile') }}-
docker-${{ runner.os }}-

- name: Cache node deps
- name: Pip Caching
uses: actions/cache@v2
with:
path: ~/.npm
key: node-${{ hashFiles('**/package-lock.json') }}
path: |
~/.cache/pip
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how you use pip, more might be stored here. Right now this is super lightweight. It also allows restore-keys because this can be used by all venvs.

key: pip-${{ runner.os }}-${{ hashFiles('**/setup.py') }}-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
node-
pip-${{ runner.os }}-

- name: Npm Caching
uses: actions/cache@v2
with:
path: |
~/.npm
**/node_modules
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat debatable that we should have node_modules in a place that uses restore-keys, but it seems like a much smaller issue than python venvs or gradle caching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't ~/.npm effectively the cache where node_modules is first populated from? does it make a difference to include node_modules directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it isn't recommended. I'll remove.

key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
npm-${{ runner.os }}-

- name: Gradle and Python Caching
uses: actions/cache@v2
with:
path: |
~/.gradle/caches
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Github Action cache examples prefer to cache the specific directories for gradle instead of the entire ~/.gradle.

This intentionally does not use restore-key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment saying this "This intentionally does not use restore-key."?

~/.gradle/wrapper
**/.venv
key: ${{ runner.os }}-${{ hashFiles('**/*.gradle*') }}-${{ hashFiles('**/package-lock.json') }}-${{ hashFiles('**/setup.py') }}-${{ hashFiles('**/requirements.txt') }}

- uses: actions/setup-java@v1
with:
Expand All @@ -43,6 +65,19 @@ jobs:
with:
python-version: '3.7'

- name: Start local Docker registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this interact with the docker cache? the idea is we get access to the Docker Caching above? i'm not sure i full understand how these two work together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses /tmp/docker-registry as the data directory for a local registry. When build_image.sh pushes data to the local registry, this is what's being cached in Docker Caching (which creates a post-step at the end which saves the contents).

run: docker run -d -p 5000:5000 --restart=always --name registry -v /tmp/docker-registry:/var/lib/registry registry:2 && npx wait-on tcp:5000

- name: Build
run: ./gradlew --no-daemon build --scan

- name: Ensure no file change
run: git status --porcelain && test -z "$(git status --porcelain)"

- name: Check documentation
if: success() && github.ref == 'refs/heads/master'
run: ./tools/site/link_checker.sh check_docs

- name: Write Integration Test Credentials
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this block down to make sure that ./gradlew build doesn't accidentally include a step that depends on credentials.

run: ./tools/bin/ci_credentials.sh
env:
Expand All @@ -62,16 +97,6 @@ jobs:
AWS_S3_INTEGRATION_TEST_CREDS: ${{ secrets.AWS_S3_INTEGRATION_TEST_CREDS }}
MAILCHIMP_TEST_CREDS: ${{ secrets.MAILCHIMP_TEST_CREDS }}

- name: Build
run: ./gradlew --no-daemon build --scan

- name: Ensure no file change
run: git status --porcelain && test -z "$(git status --porcelain)"

- name: Check documentation
if: success() && github.ref == 'refs/heads/master'
run: ./tools/site/link_checker.sh check_docs

- name: Run Integration Tests (PR)
if: success() && github.ref != 'refs/heads/master'
run: ./tools/bin/integration_test_pr.sh
Expand Down
4 changes: 1 addition & 3 deletions airbyte-integrations/connectors/source-mssql/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,15 @@ dependencies {

// on rc version due to docker pull bug in current stable version.
// issue: https://github.com/airbytehq/airbyte/issues/493
testCompile "org.testcontainers:mssqlserver:1.15.0-rc2"
testImplementation "org.testcontainers:mssqlserver:1.15.0-rc2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile is deprecated


testImplementation 'org.apache.commons:commons-text:1.9'
testImplementation 'org.apache.commons:commons-lang3:3.11'
testImplementation 'org.apache.commons:commons-dbcp2:2.7.0'
testImplementation 'org.testcontainers:postgresql:1.15.0-rc2'

testImplementation project(':airbyte-test-utils')

integrationTestImplementation project(':airbyte-integrations:bases:standard-source-test')
testCompile "org.testcontainers:mssqlserver:1.15.0-rc2"

implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs)
}
6 changes: 6 additions & 0 deletions buildSrc/src/main/groovy/airbyte-python.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,17 @@ class AirbytePythonPlugin implements Plugin<Project> {
project.task('installReqs', type: PythonTask) {
module = "pip"
command = "install -r requirements.txt"
inputs.file('requirements.txt')
outputs.file('build/installedreqs.txt')
outputs.cacheIf { true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the implication of this? it just always caches its output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells it that it's cacheable, but it still needs to re-run if the inputs change.

}

project.task('installTestReqs', type: PythonTask, dependsOn: project.installReqs) {
module = "pip"
command = "install .[tests]"
inputs.file('setup.py')
outputs.file('build/installedtestreqs.txt')
outputs.cacheIf { true }
}

if(project.file('unit_tests').exists()) {
Expand Down
19 changes: 18 additions & 1 deletion tools/bin/build_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,21 @@ assert_root

cd "$PROJECT_DIR"

DOCKER_BUILDKIT=1 docker build -f "$DOCKERFILE" . -t "$TAG" --iidfile "$ID_FILE"
if [[ -z "$CI" ]]; then
# run standard build locally (not on CI)
DOCKER_BUILDKIT=1 docker build \
-f "$DOCKERFILE" . \
-t "$TAG" \
--iidfile "$ID_FILE"
else
# run build with local docker registery for CI
docker pull localhost:5000/"$TAG" || true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing in a local repository with a custom data dir (which is also started by Github Actions) seems like the best balance between build speed and cache speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a silly question but shouldn't this also include the image name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here "$TAG" is the tagged image name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed so it's more clear

DOCKER_BUILDKIT=1 docker build \
-f "$DOCKERFILE" . \
-t "$TAG" \
--iidfile "$ID_FILE" \
--cache-from localhost:5000/"$TAG" \
--build-arg BUILDKIT_INLINE_CACHE=1
docker tag "$TAG" localhost:5000/"$TAG"
docker push localhost:5000/"$TAG"
fi