Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use existing build output #94

Conversation

chrisammon3000
Copy link
Contributor

@chrisammon3000 chrisammon3000 commented Sep 4, 2023

Description

  • Fixed errors preventing some releases from building (340, 3130)
  • Upgrade the state machine pipeline with an option to use existing build artifacts from previous executions if available (use_existing_build=true)
  • Refactored the validation step to run inside the build stage on single releases instead of running after on multiple releases
  • Added option to skip the load process, in case only the build artifacts are needed (skip_load=true)
  • Error handling during build is improved
    • Exit code of 1 indicates critical failure and causes build to fail
    • Exit code of 2 indicates non-critical failure, for example when some alleles fail to build. Build can still succeed.
  • Errors during build are output to <data_bucket>/data/<release>/errors/errors.ndjson for later analysis
    • Failed Alleles queue is removed for now as it doesn't support debugging as well as the error output

Usage

use_existing_build=true will look for existing CSV files and load these. If there are no CSVs for the release then they will be created.

skip_load=true will run only the build stage and will skip loading. This is useful when just the CSVs are needed.

# Example for single version
STAGE=<stage> make database.load.run releases="3510"

# Example for multiple versions where only 3510 has already been built
# 3490 and 3500 will be built, 3510 will use existing CSVs
STAGE=<stage> make database.load.run \
    releases="3490,3500,3510" \
    use_existing_build=true

# Example of how to build all releases and skip loading
STAGE=dev make database.load.run releases=300,310,320,330,340,350,360,370,380,390,3100,3110,3120,3130,3140,3150,3160,3170,3180,3190,3200,3210,3220,3230,3240,3250,3260,3270,3280,3290,3300,3310,3320,3330,3340,3350,3360,3370,3380,3390,3400,3410,3420,3430,3440,3450,3460,3470,3480,3490,3500,3510,3520,3530 skip_load=true

Next Steps

@chrisammon3000 chrisammon3000 linked an issue Sep 4, 2023 that may be closed by this pull request
@chrisammon3000 chrisammon3000 self-assigned this Sep 4, 2023
Copy link
Contributor

@pbashyal-nmdp pbashyal-nmdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pbashyal-nmdp pbashyal-nmdp merged commit 1344602 into nmdp-bioinformatics:incremental_load Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use previous data for build pipeline if it already exists
2 participants