Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow build-specific --narrow-bandwidth param in frequencies #1130

Merged
merged 3 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions defaults/parameters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,18 +137,29 @@ ancestral:
# Frequencies settings
frequencies:

# min_date is set by default to 1 year before present
# but can be explicitly set if desired
# default settings that can be over-ridden for specific builds
default:

# min_date is set by default to 1 year before present
min_date: "1Y"

# max_date is set by default to present date - recent_days_to_censor

# KDE bandwidths in proportion of a year to use per strain.
# using 1M bandwidth by default
narrow_bandwidth: 0.0833


# settings that can be over-ridden across all builds, but not for specific builds
recent_days_to_censor: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is now defined in two places: the config file (where this comment is added) and in the workflow:

recent_days_to_censor = config.get("frequencies", {}).get("recent_days_to_censor", 0)

Since all workflow invocations¹ source from defaults/parameters.yaml, the default value defined in common.smk never gets used. I suggest removing it:

recent_days_to_censor = config["frequencies"]["recent_days_to_censor"]

¹ under expected usage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I liked having workflow default in addition to the parameters.yaml default. As you point out this redundancy also exists for frequencies.narrow_bandwidth.

But perhaps I understand your concern, where by having multiple layers of "defaults" it might add confusion. It would be nice if people could look just at parameters.yaml to understand defaults rather than digging into the workflow.

I think the original push for adding workflow redundancy here was that parameters.yaml (still) doesn't include recent_days_to_censor and if we start looking for it in the workflow and people are stilling using old parameters.yaml files then things will break.

I think we need a broader pattern to adhere to here (kind of like the conversion in issue #1131). I'm not immediately sure what's best for this specific PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we start looking for it in the workflow and people are stilling using old parameters.yaml files then things will break.

I'm not familiar with external usage of ncov so I appreciate the thoughts here.

Since parameters.yaml is under version control, shouldn't it be sufficient to have workflow changes + parameters.yaml changes (adding recent_days_to_censor) in the same PR? If someone pulls the workflow changes, they'll also pull the default recent_days_to_censor in parameters.yaml. I can think of a few possibilities where breakage might happen:

  1. User has made custom changes to parameters.yaml. git pull will try to auto-merge and and flag any merge conflicts.

  2. User has explicitly removed this line from Snakefile:

    ncov/Snakefile

    Line 42 in 82ca8d4

    configfile: "defaults/parameters.yaml"

  3. User is using a profile that does not source from parameters.yaml, i.e. missing this line:

    configfile:
    - defaults/parameters.yaml

    This seems the most likely for external usage, and I believe it's one of the reasons why we now recommend --configfile over --profile nowadays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a broader pattern to adhere to here

The broader pattern seems to be having a default in parameters.yaml + direct access in Snakemake. This is how it's done for other config:

pivot_interval = config["frequencies"]["pivot_interval"],
pivot_interval_units = config["frequencies"]["pivot_interval_units"],
narrow_bandwidth = config["frequencies"]["narrow_bandwidth"],
proportion_wide = config["frequencies"]["proportion_wide"]

# Number of weeks between pivots
pivot_interval: 1
# Measure pivots in weeks
pivot_interval_units: "weeks"
# KDE bandwidths in proportion of a year to use per strain.
# using 15 day bandwidth
narrow_bandwidth: 0.041
proportion_wide: 0.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughts. I agree with your logic. Though I would like to push the fixes to a separate PR.


# Number of weeks between pivots
pivot_interval: 1

# Measure pivots in weeks
pivot_interval_units: "weeks"

# KDE bandwidths in proportion of a year to use per strain.
# using 15 day bandwidth
narrow_bandwidth: 0.041
# Weight of KDE that uses wide bandwidth
proportion_wide: 0.0

# Diffusion frequency settings
Expand Down
23 changes: 20 additions & 3 deletions docs/src/reference/workflow-config-file.rst
Original file line number Diff line number Diff line change
Expand Up @@ -983,13 +983,30 @@ columns

frequencies
-----------
- Valid attributes:
- type: object
- description: Parameters for specifying tip frequency calculations via ``augur frequencies``
- examples:

.. code:: yaml

frequencies:
pivot_interval_units: "weeks"
default:
min_date: "6M"
narrow_bandwidth: 0.038
global_1m:
min_date: "1M"
narrow_bandwidth: 0.019
global_2020_to_2022:
min_date: "2020-01-01"
max_date: "2022-01-01"
narrow_bandwidth: 0.076

Each named traits configuration (``default`` or build-named) supports specification of ``min_date``, ``max_date`` and ``narrow_bandwidth``. Other parameters can only be specified across all builds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

Would it be worth allowing build-specific recent_days_to_censor, pivot_interval, etc. as well? If done right, this could reduce the duplication of functions _get_min_date_for_frequencies/_get_max_date_for_frequencies/_get_narrow_bandwidth_for_wildcards.

Obviously beyond the scope this PR – I could take this or at least make an issue to start it off.

Copy link
Member Author

@trvrb trvrb Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. Ideally, every entry in parameters.yaml would be able to be over-ridden with build-specific entries in builds.yaml. As you say, the current Snakemake workflow strategy here however would require padding out individual functions for each parameter. It would be wonderful to have a generic strategy that makes for automatic over-rides.

I think maybe more important than making this work for ncov is thinking through how to do this right across pathogens. Everyone requests the ncov style parameter-overrides (this came up for mpox most obviously). cc @joverlee521

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely been a lot of discussions around build configs outside of ncov. I'll take some time to wrangle previous discussions and write up a summary outside of this PR (probably in pathogen-repo-guide).


.. contents::
:local:

.. _min_date-1:

min_date
~~~~~~~~

Expand Down
59 changes: 3 additions & 56 deletions nextstrain_profiles/nextstrain-gisaid-21L/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -896,115 +896,62 @@ traits:
# narrow_bandwidth = 0.019 or 7 days for "1m" and "2m"
# narrow_bandwidth = 0.038 or 14 days for "6m" and "all-time"
frequencies:
default:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
global_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
global_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
global_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
global_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
africa_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
africa_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
africa_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
africa_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
asia_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
asia_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
asia_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
asia_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
europe_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
europe_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
europe_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
europe_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
north-america_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
north-america_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
north-america_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
north-america_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
oceania_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
oceania_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
oceania_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
oceania_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
south-america_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
south-america_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
south-america_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
south-america_all-time:
min_date: "2022-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
59 changes: 1 addition & 58 deletions nextstrain_profiles/nextstrain-gisaid/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -887,119 +887,62 @@ traits:
# narrow_bandwidth = 0.019 or 7 days for "1m" and "2m"
# narrow_bandwidth = 0.038 or 14 days for "6m" and "all-time"
frequencies:
reference:
default:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
global_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
global_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
global_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
global_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
africa_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
africa_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
africa_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
africa_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
asia_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
asia_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
asia_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
asia_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
europe_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
europe_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
europe_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
europe_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
north-america_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
north-america_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
north-america_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
north-america_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
oceania_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
oceania_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
oceania_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
oceania_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
south-america_1m:
min_date: "1M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
south-america_2m:
min_date: "2M"
narrow_bandwidth: 0.019
recent_days_to_censor: 7
south-america_6m:
min_date: "6M"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
south-america_all-time:
min_date: "2020-01-01"
narrow_bandwidth: 0.038
recent_days_to_censor: 7
Loading