You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of pull_date.txt is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, only lake_metadata.csv is downloaded again. No zip files download.
The cause of this bug is related to the use of multiple checkpoints with a directory as an output of one of the checkpoints. The zip files are inputs to the checkpoint unzip_archive I created a MWE to illustrate and isolate the problem:
import os
from zipfile import ZipFile
def get_outputs(wildcards):
archive_file = checkpoints.get_archive_list.get().output[0]
# archive_file = 'archives.txt'
with open(archive_file, 'r') as f:
archives_lines = f.read().splitlines()
archives = [line for line in archives_lines]
suffixes = ['1', '2']
return [f"out/{archive}/{archive}{suffix}.txt" for archive in archives for suffix in suffixes]
rule all:
input: get_outputs
checkpoint get_archive_list:
input: "date_created.txt"
output: "archives.txt"
shell: "echo 'a' > {output}; echo 'b' >> {output}"
rule get_zip_file:
input: "date_created.txt"
output: "zip/{archive}.zip"
run:
with ZipFile(output[0], 'w') as zf:
# Add multiple files to the zip archive
zf.writestr(wildcards.archive + '1.txt', wildcards.archive + '1 text')
zf.writestr(wildcards.archive + '2.txt', wildcards.archive + '2 text')
checkpoint unzip_archive:
input: "zip/{archive}.zip"
output: directory("data/{archive,[^/]+}")
shell: "unzip {input} -d {output}"
def data_file(wildcards):
# Trigger checkpoint to unzip data file
data_file_directory = checkpoints.unzip_archive.get(archive=wildcards.archive).output[0]
return os.path.join(data_file_directory, wildcards.filename)
rule process_data_file:
input: data_file
output: "out/{archive}/{filename}"
shell: "cp {input} {output}"
Execute this pipeline; everything behaves as expected.
snakemake -c1
But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.
snakemake -c1 -R zip/a.zip
In get_outputs, if you bypass checkpoint get_archive_list by replacing
then re-executing zip/a.zip does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.
The text was updated successfully, but these errors were encountered:
The purpose of
pull_date.txt
is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, onlylake_metadata.csv
is downloaded again. No zip files download.The cause of this bug is related to the use of multiple checkpoints with a
directory
as an output of one of the checkpoints. The zip files are inputs to the checkpointunzip_archive
I created a MWE to illustrate and isolate the problem:Execute this pipeline; everything behaves as expected.
But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.
In
get_outputs
, if you bypass checkpointget_archive_list
by replacingwith
then re-executing
zip/a.zip
does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.The text was updated successfully, but these errors were encountered: