Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persephone #28

Merged
merged 28 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
951d6b5
Temporary copy ingest from mpox repo until using pathogen_template_repo
j23414 Nov 13, 2023
0496748
git subrepo clone (merge) https://github.com/nextstrain/ingest ingest…
j23414 Nov 13, 2023
38007bb
Replace mpox text and taxon id with zika
j23414 Nov 13, 2023
c475361
Remove Nextclade related rules
j23414 Nov 13, 2023
d154a88
Clear pathogen specific user provided annotations and rules
j23414 Nov 13, 2023
3dc55b9
NCBI Dataset field name transformations
j23414 Nov 13, 2023
17e3912
Rescue fauna data processing steps that are specific to Zika
j23414 Nov 13, 2023
6bdee23
Ignore snakemake state dir for current and subfolders
j23414 Nov 13, 2023
41c902f
Use genbank_accession column as ID column
j23414 Nov 13, 2023
7b37112
workaround to get accession links to work
j23414 Nov 15, 2023
9071f02
Move phylogenetic workflow to a phylogenetic folder
j23414 Nov 17, 2023
acd7605
Add rules for merging USVI data with NCBI GenBank ingested data.
j23414 Nov 21, 2023
11123b9
Move rules for preparing sequences to its own smk file
j23414 Dec 18, 2023
59ef926
Move rules for constructing phylogeny to its own smk file
j23414 Dec 18, 2023
4467646
Move rules for annotating phylogeny to its own smk file
j23414 Dec 18, 2023
6976946
Move rules for exporting auspice json to its own smk file
j23414 Dec 18, 2023
18e0b9b
ingest: consolidate source-data with config
j23414 Dec 18, 2023
3124b18
ingest: Always provide default config values
j23414 Dec 18, 2023
f388525
Use a rules folder for ingest to follow pathogen-repo-template
j23414 Jan 10, 2024
efe11e3
Update the CI workflow
j23414 Jan 10, 2024
6349fd7
Define input paths with literal path strings
j23414 Jan 11, 2024
44825a3
Copy contributing docs from mpox
j23414 Jan 12, 2024
aefdec1
Simplify README instructions
j23414 Jan 12, 2024
b734c78
Refactor post-processing script to be specific to zika strain name fixes
j23414 Jan 18, 2024
a99af29
Abbreviate authors inplace instead of a separate field
j23414 Jan 18, 2024
230c6b0
As discussed, always use the default config
j23414 Jan 18, 2024
8cf7326
Update Contributing docs
j23414 Jan 18, 2024
0560686
fixup: add docstring
j23414 Jan 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,24 @@ on:
- pull_request

jobs:
ci:
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@master
pathogen-ci:
strategy:
matrix:
runtime: [docker, conda]
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
runtime: ${{ matrix.runtime }}
run: |
nextstrain build \
phylogenetic \
--configfile profiles/ci/profiles_config.yaml
artifact-name: output-${{ matrix.runtime }}
artifact-paths: |
phylogenetic/auspice/
phylogenetic/results/
phylogenetic/benchmarks/
phylogenetic/logs/
phylogenetic/.snakemake/log/
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ build/
environment*

# Snakemake state dir
/.snakemake
.snakemake/
benchmarks/
logs/

# Local config overrides
/config_local.yaml
Expand Down
50 changes: 50 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Developer guide

## CI

Checks are automatically run on certain pushed commits for testing and linting
purposes. Some are defined by [.github/workflows/ci.yaml][] while others are
configured outside of this repository.
joverlee521 marked this conversation as resolved.
Show resolved Hide resolved

[.github/workflows/ci.yaml]: ./.github/workflows/ci.yaml

## Pre-commit

[pre-commit][] is used for various checks (see [configuration][]).

You can either [install it yourself][] to catch issues before pushing or look
for the [pre-commit.ci run][] after pushing.

[pre-commit]: https://pre-commit.com/
[configuration]: ./.pre-commit-config.yaml
[install it yourself]: https://pre-commit.com/#install
[pre-commit.ci run]: https://results.pre-commit.ci/repo/github/493877605

## Snakemake formatting

We use [`snakefmt`](https://github.com/snakemake/snakefmt) to ensure consistency in style across Snakemake files in this project.

### Installing

- Using mamba/bioconda:
j23414 marked this conversation as resolved.
Show resolved Hide resolved

```bash
mamba install -c bioconda snakefmt
```

- Using pip:

```bash
pip install snakefmt
```

### IDE-independent

1. Check for styling issues with `snakefmt --check .`
1. Automatically fix styling issues with `snakefmt .`

### Using VSCode extension

1. Install the [VSCode extension](https://marketplace.visualstudio.com/items?itemName=tfehlmann.snakefmt)
1. Check for styling issues with `Ctrl+Shift+P` and select `snakefmt: Check`
1. Automatically fix styling issues with `Ctrl+Shift+P` and select `Format document`
joverlee521 marked this conversation as resolved.
Show resolved Hide resolved
90 changes: 7 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,12 @@
# nextstrain.org/zika
# Nextstrain repository for Zika virus

This is the [Nextstrain](https://nextstrain.org) build for Zika, visible at
[nextstrain.org/zika](https://nextstrain.org/zika).
This repository contains two workflows for the analysis of Zika virus data:

The build encompasses fetching data, preparing it for analysis, doing quality
control, performing analyses, and saving the results in a format suitable for
visualization (with [auspice][]). This involves running components of
Nextstrain such as [fauna][] and [augur][].
- [`ingest/`](./ingest) - Download data from GenBank, clean and curate it and upload it to S3
- [`phylogenetic/`](./phylogenetic) - Make phylogenetic trees for nextstrain.org

All Zika-specific steps and functionality for the Nextstrain pipeline should be
housed in this repository.
Each folder contains a README.md with more information.

_This build requires Augur v6._
## Documentation

[![Build Status](https://github.com/nextstrain/zika/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/nextstrain/zika/actions/workflows/ci.yaml)

## Usage

If you're unfamiliar with Nextstrain builds, you may want to follow our
[quickstart guide][] first and then come back here.

There are two main ways to run & visualise the output from this build:

The first, and easiest, way to run this pathogen build is using the [Nextstrain
command-line tool][nextstrain-cli]:
```
nextstrain build .
nextstrain view auspice/
```

See the [nextstrain-cli README][] for how to install the `nextstrain` command.

The second is to install augur & auspice using conda, following [these instructions](https://nextstrain.org/docs/getting-started/local-installation#install-augur--auspice-with-conda-recommended).
The build may then be run via:
```
snakemake
auspice --datasetDir auspice/
```

Build output goes into the directories `data/`, `results/` and `auspice/`.

## Configuration

Configuration takes place entirely with the `Snakefile`. This can be read top-to-bottom, each rule
specifies its file inputs and output and also its parameters. There is little redirection and each
rule should be able to be reasoned with on its own.


## Input data

This build starts by downloading sequences from
https://data.nextstrain.org/files/zika/sequences.fasta.xz
and metadata from
https://data.nextstrain.org/files/zika/metadata.tsv.gz.
These are publicly provisioned data by the Nextstrain team by pulling sequences
from NCBI GenBank via ViPR and performing
[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).

Data from GenBank follows Open Data principles, such that we can make input data
and intermediate files available for further analysis. Open Data is data that
can be freely used, re-used and redistributed by anyone - subject only, at most,
to the requirement to attribute and sharealike.

We gratefully acknowledge the authors, originating and submitting laboratories
of the genetic sequences and metadata for sharing their work in open databases.
Please note that although data generators have generously shared data in an open
fashion, that does not mean there should be free license to publish on this
data. Data generators should be cited where possible and collaborations should
be sought in some circumstances. Please try to avoid scooping someone else's
work. Reach out if uncertain. Authors, paper references (where available) and
links to GenBank entries are provided in the metadata file.

A faster build process can be run working from example data by copying over
sequences and metadata from `example_data/` to `data/` via:
```
mkdir -p data/
cp -v example_data/* data/
```

[Nextstrain]: https://nextstrain.org
[fauna]: https://github.com/nextstrain/fauna
[augur]: https://github.com/nextstrain/augur
[auspice]: https://github.com/nextstrain/auspice
[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options
[nextstrain-cli]: https://github.com/nextstrain/cli
[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md
[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart
- [Contributor documentation](./CONTRIBUTING.md)
Loading