Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing pull_date.txt does not trigger download #15

Open
AndyMcAliley opened this issue Mar 17, 2022 · 0 comments
Open

Changing pull_date.txt does not trigger download #15

AndyMcAliley opened this issue Mar 17, 2022 · 0 comments

Comments

@AndyMcAliley
Copy link
Contributor

The purpose of pull_date.txt is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, only lake_metadata.csv is downloaded again. No zip files download.

The cause of this bug is related to the use of multiple checkpoints with a directory as an output of one of the checkpoints. The zip files are inputs to the checkpoint unzip_archive I created a MWE to illustrate and isolate the problem:

import os
from zipfile import ZipFile

def get_outputs(wildcards):
    archive_file = checkpoints.get_archive_list.get().output[0]
    # archive_file = 'archives.txt'
    with open(archive_file, 'r') as f:
        archives_lines = f.read().splitlines()
    archives = [line for line in archives_lines]
    suffixes = ['1', '2']
    return [f"out/{archive}/{archive}{suffix}.txt" for archive in archives for suffix in suffixes]

rule all:
    input: get_outputs

checkpoint get_archive_list:
    input: "date_created.txt"
    output: "archives.txt"
    shell: "echo 'a' > {output}; echo 'b' >> {output}"

rule get_zip_file:
    input: "date_created.txt"
    output: "zip/{archive}.zip"
    run: 
        with ZipFile(output[0], 'w') as zf: 
            # Add multiple files to the zip archive
            zf.writestr(wildcards.archive + '1.txt', wildcards.archive + '1 text')
            zf.writestr(wildcards.archive + '2.txt', wildcards.archive + '2 text')

checkpoint unzip_archive:
    input: "zip/{archive}.zip"
    output: directory("data/{archive,[^/]+}")
    shell: "unzip {input} -d {output}"

def data_file(wildcards):
    # Trigger checkpoint to unzip data file
    data_file_directory = checkpoints.unzip_archive.get(archive=wildcards.archive).output[0]
    return os.path.join(data_file_directory, wildcards.filename)

rule process_data_file:
    input: data_file
    output: "out/{archive}/{filename}"
    shell: "cp {input} {output}"

Execute this pipeline; everything behaves as expected.

snakemake -c1

But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.

snakemake -c1 -R zip/a.zip

In get_outputs, if you bypass checkpoint get_archive_list by replacing

archive_file = checkpoints.get_archive_list.get().output[0]

with

archive_file = 'archives.txt'

then re-executing zip/a.zip does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant