From 951d6b515a1b06f84d63dc0e2ee22bd13fe9d309 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:21:15 -0800
Subject: [PATCH 01/28] Temporary copy ingest from mpox repo until using
 pathogen_template_repo

Ideally zika ingest should be added via the pathogen_template_repo. However,
to update the recent zika sequences, this temporary branch will copy the ingest
directory from the mpox repo:

https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05

The `ingest/vendored` subdirectory is not copied over since that folder should
be added with `git subtree`.

https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05/ingest#ingestvendored

Future commits will change this to work with Zika data.
---
 ingest/README.md                              |  96 ++++++
 ingest/Snakefile                              |  75 +++++
 ingest/bin/fasta-to-ndjson                    |  86 ++++++
 ingest/bin/ndjson-to-tsv-and-fasta            |  67 +++++
 ingest/bin/reverse_reversed_sequences.py      |  29 ++
 ingest/config/config.yaml                     |  79 +++++
 ingest/config/optional.yaml                   |  25 ++
 ingest/profiles/default/config.yaml           |   4 +
 ingest/source-data/annotations.tsv            | 276 ++++++++++++++++++
 ingest/source-data/geolocation-rules.tsv      |  16 +
 ingest/source-data/ncbi-dataset-field-map.tsv |  17 ++
 ingest/source-data/nextclade-field-map.tsv    |  16 +
 .../snakemake_rules/fetch_sequences.smk       | 137 +++++++++
 ingest/workflow/snakemake_rules/nextclade.smk |  86 ++++++
 .../snakemake_rules/slack_notifications.smk   |  55 ++++
 ingest/workflow/snakemake_rules/transform.smk |  97 ++++++
 .../snakemake_rules/trigger_rebuild.smk       |  22 ++
 ingest/workflow/snakemake_rules/upload.smk    |  64 ++++
 18 files changed, 1247 insertions(+)
 create mode 100644 ingest/README.md
 create mode 100644 ingest/Snakefile
 create mode 100755 ingest/bin/fasta-to-ndjson
 create mode 100755 ingest/bin/ndjson-to-tsv-and-fasta
 create mode 100644 ingest/bin/reverse_reversed_sequences.py
 create mode 100644 ingest/config/config.yaml
 create mode 100644 ingest/config/optional.yaml
 create mode 100644 ingest/profiles/default/config.yaml
 create mode 100644 ingest/source-data/annotations.tsv
 create mode 100644 ingest/source-data/geolocation-rules.tsv
 create mode 100644 ingest/source-data/ncbi-dataset-field-map.tsv
 create mode 100644 ingest/source-data/nextclade-field-map.tsv
 create mode 100644 ingest/workflow/snakemake_rules/fetch_sequences.smk
 create mode 100644 ingest/workflow/snakemake_rules/nextclade.smk
 create mode 100644 ingest/workflow/snakemake_rules/slack_notifications.smk
 create mode 100644 ingest/workflow/snakemake_rules/transform.smk
 create mode 100644 ingest/workflow/snakemake_rules/trigger_rebuild.smk
 create mode 100644 ingest/workflow/snakemake_rules/upload.smk

diff --git a/ingest/README.md b/ingest/README.md
new file mode 100644
index 0000000..b7eb815
--- /dev/null
+++ b/ingest/README.md
@@ -0,0 +1,96 @@
+# nextstrain.org/mpox/ingest
+
+This is the ingest pipeline for mpox virus sequences.
+
+## Software requirements
+
+Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html) for Nextstrain's suite of software tools.
+
+## Usage
+
+> NOTE: All command examples assume you are within the `ingest` directory.
+> If running commands from the outer `mpox` directory, please replace the `.` with `ingest`
+
+Fetch sequences with
+
+```sh
+nextstrain build . data/sequences.ndjson
+```
+
+Run the complete ingest pipeline with
+
+```sh
+nextstrain build .
+```
+
+This will produce two files (within the `ingest` directory):
+
+- `results/metadata.tsv`
+- `results/sequences.fasta`
+
+Run the complete ingest pipeline and upload results to AWS S3 with
+
+```sh
+nextstrain build . --configfiles config/config.yaml config/optional.yaml
+```
+
+### Adding new sequences not from GenBank
+
+#### Static Files
+
+Do the following to include sequences from static FASTA files.
+
+1. Convert the FASTA files to NDJSON files with:
+
+    ```sh
+    ./ingest/bin/fasta-to-ndjson \
+        --fasta {path-to-fasta-file} \
+        --fields {fasta-header-field-names} \
+        --separator {field-separator-in-header} \
+        --exclude {fields-to-exclude-in-output} \
+        > ingest/data/{file-name}.ndjson
+    ```
+
+2. Add the following to the `.gitignore` to allow the file to be included in the repo:
+
+    ```gitignore
+    !ingest/data/{file-name}.ndjson
+    ```
+
+3. Add the `file-name` (without the `.ndjson` extension) as a source to `ingest/config/config.yaml`. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.
+
+## Configuration
+
+Configuration takes place in `config/config.yaml` by default.
+Optional configs for uploading files and Slack notifications are in `config/optional.yaml`.
+
+### Environment Variables
+
+The complete ingest pipeline with AWS S3 uploads and Slack notifications uses the following environment variables:
+
+#### Required
+
+- `AWS_ACCESS_KEY_ID`
+- `AWS_SECRET_ACCESS_KEY`
+- `SLACK_TOKEN`
+- `SLACK_CHANNELS`
+
+#### Optional
+
+These are optional environment variables used in our automated pipeline for providing detailed Slack notifications.
+
+- `GITHUB_RUN_ID` - provided via [`github.run_id` in a GitHub Action workflow](https://docs.github.com/en/actions/learn-github-actions/contexts#github-context)
+- `AWS_BATCH_JOB_ID` - provided via [AWS Batch Job environment variables](https://docs.aws.amazon.com/batch/latest/userguide/job_env_vars.html)
+
+## Input data
+
+### GenBank data
+
+GenBank sequences and metadata are fetched via [NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/).
+
+## `ingest/vendored`
+
+This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in [ingest/vendored](./vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest).
+
+See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update
+the vendored scripts.
diff --git a/ingest/Snakefile b/ingest/Snakefile
new file mode 100644
index 0000000..0ed057b
--- /dev/null
+++ b/ingest/Snakefile
@@ -0,0 +1,75 @@
+from snakemake.utils import min_version
+
+min_version(
+    "7.7.0"
+)  # Snakemake 7.7.0 introduced `retries` directive used in fetch-sequences
+
+if not config:
+
+    configfile: "config/config.yaml"
+
+
+send_slack_notifications = config.get("send_slack_notifications", False)
+
+
+def _get_all_targets(wildcards):
+    # Default targets are the metadata TSV and sequences FASTA files
+    all_targets = ["results/sequences.fasta", "results/metadata.tsv"]
+
+    # Add additional targets based on upload config
+    upload_config = config.get("upload", {})
+
+    for target, params in upload_config.items():
+        files_to_upload = params.get("files_to_upload", {})
+
+        if not params.get("dst"):
+            print(
+                f"Skipping file upload for {target!r} because the destination was not defined."
+            )
+        else:
+            all_targets.extend(
+                expand(
+                    [f"data/upload/{target}/{{remote_file_name}}.done"],
+                    zip,
+                    remote_file_name=files_to_upload.keys(),
+                )
+            )
+
+    # Add additional targets for Nextstrain's internal Slack notifications
+    if send_slack_notifications:
+        all_targets.extend(
+            [
+                "data/notify/genbank-record-change.done",
+                "data/notify/metadata-diff.done",
+            ]
+        )
+
+    if config.get("trigger_rebuild", False):
+        all_targets.append("data/trigger/rebuild.done")
+
+    return all_targets
+
+
+rule all:
+    input:
+        _get_all_targets,
+
+
+include: "workflow/snakemake_rules/fetch_sequences.smk"
+include: "workflow/snakemake_rules/transform.smk"
+include: "workflow/snakemake_rules/nextclade.smk"
+
+
+if config.get("upload", False):
+
+    include: "workflow/snakemake_rules/upload.smk"
+
+
+if send_slack_notifications:
+
+    include: "workflow/snakemake_rules/slack_notifications.smk"
+
+
+if config.get("trigger_rebuild", False):
+
+    include: "workflow/snakemake_rules/trigger_rebuild.smk"
diff --git a/ingest/bin/fasta-to-ndjson b/ingest/bin/fasta-to-ndjson
new file mode 100755
index 0000000..1ee9f8f
--- /dev/null
+++ b/ingest/bin/fasta-to-ndjson
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+"""
+Parse delimited fields from FASTA header into NDJSON format to stdout.
+The output NDJSON records are guaranteed to have at least two fields:
+    1. strain
+    2. sequence
+
+Uses the `augur.io.read_sequences` function to read the FASTA file,
+so `augur` must be installed in the environment running the script.
+"""
+
+import argparse
+import json
+import sys
+
+from augur.io import read_sequences
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--fasta", required=True,
+        help="FASTA file to be transformed into NDJSON format")
+    parser.add_argument("--fields", nargs="+",
+        help="Fields in the FASTA header, listed in the same order as the header. " +
+             "These will be used as the keys in the final NDJSON output. " +
+             "One of the fields must be 'strain'. " +
+             "These cannot include the field 'sequence' as this field is reserved for the genomic sequence.")
+    parser.add_argument("--separator", default='|',
+        help="Field separator in the FASTA header")
+    parser.add_argument("--exclude", nargs="*",
+        help="List of fields to exclude from final NDJSON record. "
+             "These cannot include 'strain' or 'sequence'.")
+
+    args = parser.parse_args()
+
+    fasta_fields = [field.lower() for field in args.fields]
+
+    exclude_fields = []
+    if args.exclude:
+        exclude_fields = [field.lower() for field in args.exclude]
+
+    passed_checks = True
+
+    if 'strain' not in fasta_fields:
+        print("ERROR: FASTA fields must include a 'strain' field.", file=sys.stderr)
+        passed_checks = False
+
+    if 'sequence' in fasta_fields:
+        print("ERROR: FASTA fields cannot include a 'sequence' field.", file=sys.stderr)
+        passed_checks = False
+
+    if 'strain' in exclude_fields:
+        print("ERROR: The field 'strain' cannot be excluded from the output.", file=sys.stderr)
+        passed_checks = False
+
+    if 'sequence' in exclude_fields:
+        print("ERROR: The field 'sequence' cannot be excluded from the output.", file=sys.stderr)
+        passed_checks = False
+
+    missing_fields = [field for field in exclude_fields if field not in fasta_fields]
+    if missing_fields:
+        print(f"ERROR: The following exclude fields do not match any FASTA fields: {missing_fields}", file=sys.stderr)
+        passed_checks = False
+
+    if not passed_checks:
+        print("ERROR: Failed to parse FASTA file into NDJSON records.","See detailed errors above.", file=sys.stderr)
+        sys.exit(1)
+
+    sequences = read_sequences(args.fasta)
+
+    for sequence in sequences:
+        field_values = [
+            value.strip()
+            for value in sequence.description.split(args.separator)
+        ]
+        record = dict(zip(fasta_fields, field_values))
+        record['sequence'] = str(sequence.seq).upper()
+
+        for field in exclude_fields:
+            del record[field]
+
+        json.dump(record, sys.stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/bin/ndjson-to-tsv-and-fasta b/ingest/bin/ndjson-to-tsv-and-fasta
new file mode 100755
index 0000000..d9d7331
--- /dev/null
+++ b/ingest/bin/ndjson-to-tsv-and-fasta
@@ -0,0 +1,67 @@
+#!/usr/bin/env python3
+"""
+Parses NDJSON records from stdin to two different files: a metadata TSV and a
+sequences FASTA.
+
+Records that do not have an ID or sequence will be excluded from the output files.
+"""
+import argparse
+import csv
+import json
+from sys import stderr, stdin
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--metadata", metavar="TSV", default="data/metadata.tsv",
+        help="The output metadata TSV file")
+    parser.add_argument("--fasta", metavar="FASTA", default="data/sequences.fasta",
+        help="The output sequences FASTA file")
+    parser.add_argument("--metadata-columns", nargs="+",
+        help="List of fields from the NDJSON records to include as columns in the metadata TSV. " +
+             "Metadata TSV columns will be in the order of the columns provided.")
+    parser.add_argument("--id-field", default='strain',
+        help="Field from the records to use as the sequence ID in the FASTA file.")
+    parser.add_argument("--sequence-field", default='sequence',
+        help="Field from the record that holds the genomic sequence for the FASTA file.")
+
+    args = parser.parse_args()
+
+    with open(args.metadata, 'wt') as metadata_output:
+        with open(args.fasta, 'wt') as fasta_output:
+            metadata_csv = csv.DictWriter(
+                metadata_output,
+                args.metadata_columns,
+                restval="",
+                extrasaction='ignore',
+                delimiter='\t',
+                lineterminator='\n',
+            )
+            metadata_csv.writeheader()
+
+            for index, record in enumerate(stdin):
+                record = json.loads(record)
+
+                sequence_id = str(record.get(args.id_field, ''))
+                sequence = str(record.get(args.sequence_field, ''))
+
+                if not sequence_id:
+                    print(
+                        f"WARNING: Record number {index} does not have a sequence ID.",
+                        "This record will be excluded from the output files.",
+                        file=stderr
+                    )
+                elif not sequence:
+                    print(
+                        f"WARNING: Record number {index} does not have a sequence.",
+                        "This record will be excluded from the output files.",
+                        file=stderr
+                    )
+                else:
+                    metadata_csv.writerow(record)
+
+                    print(f">{sequence_id}", file=fasta_output)
+                    print(f"{sequence}" , file= fasta_output)
diff --git a/ingest/bin/reverse_reversed_sequences.py b/ingest/bin/reverse_reversed_sequences.py
new file mode 100644
index 0000000..6ca5ed2
--- /dev/null
+++ b/ingest/bin/reverse_reversed_sequences.py
@@ -0,0 +1,29 @@
+import pandas as pd
+import argparse
+from Bio import SeqIO
+
+if __name__=="__main__":
+    parser = argparse.ArgumentParser(
+        description="Reverse-complement reverse-complemented sequence",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+
+    parser.add_argument('--metadata', type=str, required=True, help="input metadata")
+    parser.add_argument('--sequences', type=str, required=True, help="input sequences")
+    parser.add_argument('--output', type=str, required=True, help="output sequences")
+    args = parser.parse_args()
+
+    metadata = pd.read_csv(args.metadata, sep='\t')
+
+    # Read in fasta file
+    with open(args.sequences, 'r') as f_in:
+        with open(args.output, 'w') as f_out:
+            for seq in SeqIO.parse(f_in, 'fasta'):
+                # Check if metadata['reverse'] is True
+                if metadata.loc[metadata['accession'] == seq.id, 'reverse'].values[0] == True:
+                    # Reverse-complement sequence
+                    seq.seq = seq.seq.reverse_complement()
+                    print("Reverse-complementing sequence:", seq.id)
+
+                # Write sequences to file
+                SeqIO.write(seq, f_out, 'fasta')
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
new file mode 100644
index 0000000..8d18c5f
--- /dev/null
+++ b/ingest/config/config.yaml
@@ -0,0 +1,79 @@
+# Sources of sequences to include in the ingest run
+sources: ['genbank']
+# Pathogen NCBI Taxonomy ID
+ncbi_taxon_id: '10244'
+# Renames the NCBI dataset headers
+ncbi_field_map: 'source-data/ncbi-dataset-field-map.tsv'
+
+# Params for the transform rule
+transform:
+  # Fields to rename.
+  # This is the first step in the pipeline, so any references to field names
+  # in the configs below should use the new field names
+  field_map: ['collected=date', 'submitted=date_submitted', 'genbank_accession=accession', 'submitting_organization=institution']
+  # Standardized strain name regex
+  # Currently accepts any characters because we do not have a clear standard for strain names
+  strain_regex: '^.+$'
+  # Back up strain name field if 'strain' doesn't match regex above
+  strain_backup_fields: ['accession']
+  # List of date fields to standardize
+  date_fields: ['date', 'date_submitted']
+  # Expected date formats present in date fields
+  # These date formats should use directives expected by datetime
+  # See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
+  expected_date_formats: ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ']
+  # Titlecase rules
+  titlecase:
+    # Abbreviations not cast to titlecase, keeps uppercase
+    abbreviations: ['USA']
+    # Articles that should not be cast to titlecase
+    articles: [
+      'and', 'd', 'de', 'del', 'des', 'di', 'do', 'en', 'l', 'la', 'las', 'le',
+      'los', 'nad', 'of', 'op', 'sur', 'the', 'y'
+    ]
+    # List of string fields to titlecase
+    fields: ['region', 'country', 'division', 'location']
+  # Authors field name
+  authors_field: 'authors'
+  # Authors default value if authors value is empty
+  authors_default_value: '?'
+  # Field name for the generated abbreviated authors
+  abbr_authors_field: 'abbr_authors'
+  # General geolocation rules to apply to geolocation fields
+  geolocation_rules_url: 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv'
+  # Local geolocation rules that are only applicable to mpox data
+  # Local rules can overwrite the general geolocation rules provided above
+  local_geolocation_rules: 'source-data/geolocation-rules.tsv'
+  # User annotations file
+  annotations: 'source-data/annotations.tsv'
+  # ID field used to merge annotations
+  annotations_id: 'accession'
+  # Field to use as the sequence ID in the FASTA file
+  id_field: 'accession'
+  # Field to use as the sequence in the FASTA file
+  sequence_field: 'sequence'
+  # Final output columns for the metadata TSV
+  metadata_columns: [
+    'accession',
+    'genbank_accession_rev',
+    'strain',
+    'date',
+    'region',
+    'country',
+    'division',
+    'location',
+    'host',
+    'date_submitted',
+    'sra_accession',
+    'abbr_authors',
+    'reverse',
+    'authors',
+    'institution'
+  ]
+
+# Params for Nextclade related rules
+nextclade:
+  # Field to use as the sequence ID in the Nextclade file
+  id_field: 'seqName'
+  # Fields from a Nextclade file to be renamed (if desired) and appended to a metadata file
+  field_map: 'source-data/nextclade-field-map.tsv'
diff --git a/ingest/config/optional.yaml b/ingest/config/optional.yaml
new file mode 100644
index 0000000..d445e07
--- /dev/null
+++ b/ingest/config/optional.yaml
@@ -0,0 +1,25 @@
+# Optional configs used by Nextstrain team
+# Params for uploads
+upload:
+  # Upload params for AWS S3
+  s3:
+    # AWS S3 Bucket with prefix
+    dst: 's3://nextstrain-data/files/workflows/mpox'
+    # Mapping of files to upload, with key as remote file name and the value
+    # the local file path relative to the ingest directory.
+    files_to_upload:
+      genbank.ndjson.xz: data/genbank.ndjson
+      all_sequences.ndjson.xz: data/sequences.ndjson
+      metadata.tsv.gz: results/metadata.tsv
+      sequences.fasta.xz: results/sequences.fasta
+      alignment.fasta.xz: data/alignment.fasta
+      insertions.csv.gz: data/insertions.csv
+      translations.zip: data/translations.zip
+
+    cloudfront_domain: 'data.nextstrain.org'
+
+# Toggle for Slack notifications
+send_slack_notifications: True
+
+# Toggle for triggering builds
+trigger_rebuild: True
diff --git a/ingest/profiles/default/config.yaml b/ingest/profiles/default/config.yaml
new file mode 100644
index 0000000..c69390b
--- /dev/null
+++ b/ingest/profiles/default/config.yaml
@@ -0,0 +1,4 @@
+cores: all
+rerun-incomplete: true
+printshellcmds: true
+reason: true
diff --git a/ingest/source-data/annotations.tsv b/ingest/source-data/annotations.tsv
new file mode 100644
index 0000000..e3c3ec1
--- /dev/null
+++ b/ingest/source-data/annotations.tsv
@@ -0,0 +1,276 @@
+AF380138	country	Democratic Republic of the Congo
+AY741551	country	Sierra Leone
+DQ011153	country	USA
+DQ011154	country	Republic of the Congo
+DQ011155	country	Democratic Republic of the Congo
+DQ011156	country	Liberia
+DQ011157	country	USA
+NC_003310	country	Democratic Republic of the Congo
+OR473631	country	France
+AF380138	region	Africa
+AY741551	region	Africa
+AY753185	region	Africa
+DQ011153	region	North America
+DQ011154	region	Africa
+DQ011155	region	Africa
+DQ011156	region	Africa
+DQ011157	region	North America
+NC_003310	region	Africa
+OR473631	region	Europe
+AY603973	date	1961-XX-XX
+AY741551	date	1970-XX-XX
+AY753185	date	1958-XX-XX
+DQ011153	date	2003-XX-XX
+DQ011156	date	1970-XX-XX
+DQ011157	date	2003-XX-XX
+LC722946	date	2022-07-XX
+MG693723	date	2017-XX-XX
+AF380138	date	1996-XX-XX
+DQ011154	date	2003-XX-XX
+DQ011155	date	1978-XX-XX
+HM172544	date	1979-XX-XX
+HQ857562	date	1979-XX-XX
+MT903337	date	2018-XX-XX
+MT903338	date	2018-XX-XX
+MT903339	date	2018-XX-XX
+MT903340	date	2018-XX-XX
+MT903341	date	2018-08-14
+MT903342	date	2019-04-30
+MT903342	date	2019-05-XX
+MT903343	date	2018-09-XX
+MT903344	date	2018-09-XX
+MT903345	date	2018-09-XX
+MT903346	date	2003-XX-XX
+MT903347	date	2003-XX-XX
+MT903348	date	2003-XX-XX
+ON782021	date	2022-05-24
+ON782022	date	2022-05-31
+ON880529	date	2022-05-28
+ON880533	date	2022-05-30
+ON880534	date	2022-05-30
+OP536786	date	2022-XX-XX
+OX044336	date	2022-XX-XX
+OX044337	date	2022-XX-XX
+OX044338	date	2022-XX-XX
+OX044339	date	2022-XX-XX
+OX044340	date	2022-XX-XX
+OX044341	date	2022-XX-XX
+OX044342	date	2022-XX-XX
+OX044343	date	2022-XX-XX
+OX044344	date	2022-XX-XX
+OX044345	date	2022-XX-XX
+OX044346	date	2022-XX-XX
+OX044347	date	2022-XX-XX
+OX044348	date	2022-XX-XX
+OX344864	date	2022-XX-XX
+OX344865	date	2022-XX-XX
+OX344866	date	2022-XX-XX
+OX344867	date	2022-XX-XX
+OX344868	date	2022-XX-XX
+OX344869	date	2022-XX-XX
+OX344870	date	2022-XX-XX
+OX344871	date	2022-XX-XX
+OX344872	date	2022-XX-XX
+OX344873	date	2022-XX-XX
+OX344874	date	2022-XX-XX
+OX344875	date	2022-XX-XX
+OX344876	date	2022-XX-XX
+OX344877	date	2022-XX-XX
+OX344878	date	2022-XX-XX
+OX344879	date	2022-XX-XX
+OX344880	date	2022-XX-XX
+OX344881	date	2022-XX-XX
+OX344882	date	2022-XX-XX
+OX344883	date	2022-XX-XX
+OX344884	date	2022-XX-XX
+OX344885	date	2022-XX-XX
+OX344886	date	2022-XX-XX
+OX344887	date	2022-XX-XX
+OX344888	date	2022-XX-XX
+OX344889	date	2022-XX-XX
+OX344890	date	2022-XX-XX
+AF380138	strain	Zaire-96-I-16
+DQ011154	strain	Congo_2003_358
+DQ011155	strain	Zaire_1979-005
+HM172544	strain	Zaire 1979-005
+HQ857562	strain	V79-I-005
+MT903338	strain	MPXV-M2957_Lagos
+AY603973	institution	Biochemistry & Microbiology, University of Victoria, Canada
+AY741551	institution	Biochemistry & Microbiology, University of Victoria, Canada
+AY753185	institution	Biochemistry & Microbiology, University of Victoria, Canada
+DQ011153	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
+DQ011154	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
+DQ011156	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
+DQ011157	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
+FV537349	institution	Genetics Signatures, Sydney, Australia
+FV537350	institution	Genetics Signatures, Sydney, Australia
+FV537351	institution	Genetics Signatures, Sydney, Australia
+FV537352	institution	Genetics Signatures, Sydney, Australia
+HM172544	institution	Virology, The United States Army Medical Research Institute for Infectious Diseases (USAMRIID), USA
+HQ857562	institution	Vaccine and Gene Therapy Institute, Oregon Health and Science University, USA
+HQ857563	institution	Vaccine and Gene Therapy Institute, Oregon Health and Science University, USA
+JX878407	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878408	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878409	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878410	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878411	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878412	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878413	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878414	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878415	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878416	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878417	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878418	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878419	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878420	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878421	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878422	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878423	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878424	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878425	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878426	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878427	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878428	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+JX878429	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
+KC257459	institution	Biochemistry & Microbiology, University of Victoria, Canada
+KC257460	institution	Biochemistry & Microbiology, University of Victoria, Canada
+KJ136820	institution	Centre for Biological Security, Robert Koch Institute, Germany
+KJ642612	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642613	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642614	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642615	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642616	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642617	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642618	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KJ642619	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KP849469	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KP849470	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+KP849471	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
+MK783028	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
+MK783029	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
+MK783030	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
+MK783031	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
+MK783032	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
+MN346690	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346692	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346693	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346694	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346695	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346696	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346698	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346699	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346700	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN346702	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
+MN648051	institution	Biochemistry & Molecular Genetics, Israel Institute for Biological Research, Israel
+MN702444	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702445	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702446	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702447	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702448	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702449	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702450	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702451	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702452	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MN702453	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
+MT724769	institution	Biology, Universiteit Antwerpen, Belgium
+MT724770	institution	Biology, Universiteit Antwerpen, Belgium
+MT724771	institution	Biology, Universiteit Antwerpen, Belgium
+MT724772	institution	Biology, Universiteit Antwerpen, Belgium
+MT903337	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903338	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903339	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903340	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903341	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903342	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903343	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903344	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903345	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903346	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903347	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+MT903348	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+NC_003310	institution	Department of Molecular Biology of Genomes, SRC VB Vector, Russia
+ON563414	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON568298	institution	Microbiol Genomics & Bioinformatics, Bundeswehr Institute of Microbiology, Germany
+ON585029	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585030	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585031	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585032	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585033	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585034	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585035	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585036	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585037	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON585038	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON595760	institution	Laboratory of Viorology, University Hospital of Geneva, Switzerland
+ON602722	institution	IHAP, VIRAL, Universite de Toulouse, INRAE, ENVT, France
+ON609725	institution	Laboratory for Diagnostics of Zoonoses & WHO Centre, Institute of Microbiology & Immunology, Faculty of Medicine, University of Ljubljana, Slovenia
+ON614676	institution	Laboratory of Virology, INMI Lazzaro Spallanzani IRCCS, Italy
+ON615424	institution	Public Health Virology, Erasmus Medical Centre, The Netherlands
+ON619835	institution	Research & Evaluation, UKHSA, UK
+ON619836	institution	Research & Evaluation, UKHSA, UK
+ON619837	institution	Research & Evaluation, UKHSA, UK
+ON619838	institution	Research & Evaluation, UKHSA, UK
+ON622712	institution	Microbiology, Immunology & Transplantation, KU Leuven, Rega Institute, Belgium
+ON622713	institution	Microbiology, Immunology & Transplantation, KU Leuven, Rega Institute, Belgium
+ON622718	institution	Microbiology, Hospital Universitari Germans Trias i Pujol, Spain
+ON622720	institution	Laboratory of Viorology, University Hospital of Geneva, Switzerland
+ON622721	institution	Department of Biomedical & Clinical Sciences, University of Milan, Italy
+ON622722	institution	Virology, GENomique EPIdemiologique das maladies Infectieuses, France
+ON627808	institution	Department of Health, Utah Public Health Laboratory, USA
+ON631241	institution	Laboratory for Diagnostics of Zoonoses & WHO Centre, Institute of Microbiology & Immunology, Faculty of Medicine, University of Ljubljana, Slovenia
+ON631963	institution	Victorian Infectious Diseases Reference Laboratory, Doherty Institute, Australia
+ON637938	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON637939	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON644344	institution	Genomics & Epigenomics, AREA Science Park, Italy
+ON645312	institution	Centre for Clinical Infection & Diagnostics Research, Kings College London, St Thomas Hospital, UK
+ON649708	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649709	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649710	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649711	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649712	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649713	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649714	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649715	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649716	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649717	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649718	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649719	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649720	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649721	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649722	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649723	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649724	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649725	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
+ON649879	institution	Biochemistry & Molecular Genetics, Israel Institute for Biological Research, Israel
+ON674051	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON675438	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676703	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676704	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676705	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676706	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676707	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON676708	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
+ON682263	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682264	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682265	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682266	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682267	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682268	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682269	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON682270	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694329	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694330	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694331	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694332	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694333	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694334	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694335	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694336	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694337	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694338	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694339	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694340	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694341	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON694342	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
+ON720848	institution	Microbial Genomics, Hospital General Universitario Gregorio Marañón, Madrid, Spain
+ON720849	institution	Microbial Genomics, Hospital General Universitario Gregorio Marañón, Madrid, Spain
diff --git a/ingest/source-data/geolocation-rules.tsv b/ingest/source-data/geolocation-rules.tsv
new file mode 100644
index 0000000..2fc03f8
--- /dev/null
+++ b/ingest/source-data/geolocation-rules.tsv
@@ -0,0 +1,16 @@
+Africa/Cote d'Ivoire/*/*	Africa/Côte d'Ivoire/*/*
+Africa/Cote d'Ivoire/Tai National Park/*	Africa/Côte d'Ivoire/Bas-Sassandra/Tai National Park
+Africa/Democratic Republic of the Congo/Province Bandundu/*	Africa/Democratic Republic of the Congo/Bandundu/*
+Africa/Democratic Republic of the Congo/Province Equateur/*	Africa/Democratic Republic of the Congo/Équateur/*
+Africa/Democratic Republic of the Congo/Province Kasai Occidental/*	Africa/Democratic Republic of the Congo/Kasaï-Occidental/*
+Africa/Democratic Republic of the Congo/Province Kasai Oriental/*	Africa/Democratic Republic of the Congo/Kasaï-Oriental/*
+Africa/Democratic Republic of the Congo/Province P. Oriental/*	Africa/Democratic Republic of the Congo/Orientale/*
+Africa/Democratic Republic of the Congo/Yangdongi/	Africa/Democratic Republic of the Congo/Mongala/Yangdongi
+Africa/Democratic Republic of the Congo/Zaire/*	Africa/Democratic Republic of the Congo//
+Africa/Zaire/*/*	Africa/Democratic Republic of the Congo//
+*/Zaire/*/*	Africa/Democratic Republic of the Congo//
+Europe/France/Paris/*	Europe/France/Ile de France/Paris FR
+Europe/Italy/Fvg/Gorizia	Europe/Italy/Friuli Venezia Giulia/Gorizia
+# Unclear which location is the real location
+Europe/Netherlands/Utrecht/Rotterdam	Europe/Netherlands//
+North America/USA/Washington/Dc	North America/USA/Washington DC/
diff --git a/ingest/source-data/ncbi-dataset-field-map.tsv b/ingest/source-data/ncbi-dataset-field-map.tsv
new file mode 100644
index 0000000..eb79418
--- /dev/null
+++ b/ingest/source-data/ncbi-dataset-field-map.tsv
@@ -0,0 +1,17 @@
+key	value
+Accession	genbank_accession_rev
+Source database	database
+Isolate Lineage	strain
+Geographic Region	region
+Geographic Location	location
+Isolate Collection date	collected
+Release date	submitted
+Update date	updated
+Length	length
+Host Name	host
+Isolate Lineage source	isolation_source
+BioProjects	bioproject_accession
+BioSample accession	biosample_accession
+SRA Accessions	sra_accession
+Submitter Names	authors
+Submitter Affiliation	submitting_organization
diff --git a/ingest/source-data/nextclade-field-map.tsv b/ingest/source-data/nextclade-field-map.tsv
new file mode 100644
index 0000000..a495da3
--- /dev/null
+++ b/ingest/source-data/nextclade-field-map.tsv
@@ -0,0 +1,16 @@
+key	value
+seqName	seqName
+clade	clade
+outbreak	outbreak
+lineage	lineage
+coverage	coverage
+totalMissing	missing_data
+totalSubstitutions	divergence
+totalNonACGTNs	nonACGTN
+qc.missingData.status	QC_missing_data
+qc.mixedSites.status	QC_mixed_sites
+qc.privateMutations.status	QC_rare_mutations
+qc.frameShifts.status	QC_frame_shifts
+qc.stopCodons.status	QC_stop_codons
+frameShifts	frame_shifts
+isReverseComplement	is_reverse_complement
\ No newline at end of file
diff --git a/ingest/workflow/snakemake_rules/fetch_sequences.smk b/ingest/workflow/snakemake_rules/fetch_sequences.smk
new file mode 100644
index 0000000..3f32f9b
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/fetch_sequences.smk
@@ -0,0 +1,137 @@
+"""
+This part of the workflow handles fetching sequences from various sources.
+Uses `config.sources` to determine which sequences to include in final output.
+
+Currently only fetches sequences from GenBank, but other sources can be
+defined in the config. If adding other sources, add a new rule upstream
+of rule `fetch_all_sequences` to create the file `data/{source}.ndjson` or the
+file must exist as a static file in the repo.
+
+Produces final output as
+
+    sequences_ndjson = "data/sequences.ndjson"
+
+"""
+
+
+rule fetch_ncbi_dataset_package:
+    output:
+        dataset_package=temp("data/ncbi_dataset.zip"),
+    retries: 5  # Requires snakemake 7.7.0 or later
+    benchmark:
+        "benchmarks/fetch_ncbi_dataset_package.txt"
+    params:
+        ncbi_taxon_id=config["ncbi_taxon_id"],
+    shell:
+        """
+        datasets download virus genome taxon {params.ncbi_taxon_id} \
+            --no-progressbar \
+            --filename {output.dataset_package}
+        """
+
+
+rule extract_ncbi_dataset_sequences:
+    input:
+        dataset_package="data/ncbi_dataset.zip",
+    output:
+        ncbi_dataset_sequences=temp("data/ncbi_dataset_sequences.fasta"),
+    benchmark:
+        "benchmarks/extract_ncbi_dataset_sequences.txt"
+    shell:
+        """
+        unzip -jp {input.dataset_package} \
+            ncbi_dataset/data/genomic.fna > {output.ncbi_dataset_sequences}
+        """
+
+
+def _get_ncbi_dataset_field_mnemonics(wildcards) -> str:
+    """
+    Return list of NCBI Dataset report field mnemonics for fields that we want
+    to parse out of the dataset report. The column names in the output TSV
+    are different from the mnemonics.
+
+    See NCBI Dataset docs for full list of available fields and their column
+    names in the output:
+    https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
+    """
+    fields = [
+        "accession",
+        "sourcedb",
+        "isolate-lineage",
+        "geo-region",
+        "geo-location",
+        "isolate-collection-date",
+        "release-date",
+        "update-date",
+        "length",
+        "host-name",
+        "isolate-lineage-source",
+        "bioprojects",
+        "biosample-acc",
+        "sra-accs",
+        "submitter-names",
+        "submitter-affiliation",
+    ]
+    return ",".join(fields)
+
+
+rule format_ncbi_dataset_report:
+    # Formats the headers to be the same as before we used NCBI Datasets
+    # The only fields we do not have equivalents for are "title" and "publications"
+    input:
+        dataset_package="data/ncbi_dataset.zip",
+        ncbi_field_map=config["ncbi_field_map"],
+    output:
+        ncbi_dataset_tsv=temp("data/ncbi_dataset_report.tsv"),
+    params:
+        fields_to_include=_get_ncbi_dataset_field_mnemonics,
+    benchmark:
+        "benchmarks/format_ncbi_dataset_report.txt"
+    shell:
+        """
+        dataformat tsv virus-genome \
+            --package {input.dataset_package} \
+            --fields {params.fields_to_include:q} \
+            | csvtk -tl rename2 -F -f '*' -p '(.+)' -r '{{kv}}' -k {input.ncbi_field_map} \
+            | csvtk -tl mutate -f genbank_accession_rev -n genbank_accession -p "^(.+?)\." \
+            | tsv-select -H -f genbank_accession --rest last \
+            > {output.ncbi_dataset_tsv}
+        """
+
+
+rule format_ncbi_datasets_ndjson:
+    input:
+        ncbi_dataset_sequences="data/ncbi_dataset_sequences.fasta",
+        ncbi_dataset_tsv="data/ncbi_dataset_report.tsv",
+    output:
+        ndjson="data/genbank.ndjson",
+    log:
+        "logs/format_ncbi_datasets_ndjson.txt",
+    benchmark:
+        "benchmarks/format_ncbi_datasets_ndjson.txt"
+    shell:
+        """
+        augur curate passthru \
+            --metadata {input.ncbi_dataset_tsv} \
+            --fasta {input.ncbi_dataset_sequences} \
+            --seq-id-column genbank_accession_rev \
+            --seq-field sequence \
+            --unmatched-reporting warn \
+            --duplicate-reporting warn \
+            2> {log} > {output.ndjson}
+        """
+
+
+def _get_all_sources(wildcards):
+    return [f"data/{source}.ndjson" for source in config["sources"]]
+
+
+rule fetch_all_sequences:
+    input:
+        all_sources=_get_all_sources,
+    output:
+        sequences_ndjson="data/sequences.ndjson",
+    shell:
+        """
+        cat {input.all_sources} > {output.sequences_ndjson}
+        """
diff --git a/ingest/workflow/snakemake_rules/nextclade.smk b/ingest/workflow/snakemake_rules/nextclade.smk
new file mode 100644
index 0000000..f10a3f9
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/nextclade.smk
@@ -0,0 +1,86 @@
+
+rule nextclade_dataset:
+    output:
+        temp("mpxv.zip"),
+    shell:
+        """
+        nextclade dataset get --name MPXV --output-zip {output}
+        """
+
+
+rule nextclade_dataset_hMPXV:
+    output:
+        temp("hmpxv.zip"),
+    shell:
+        """
+        nextclade dataset get --name hMPXV --output-zip {output}
+        """
+
+
+rule align:
+    input:
+        sequences="results/sequences.fasta",
+        dataset="hmpxv.zip",
+    output:
+        alignment="data/alignment.fasta",
+        insertions="data/insertions.csv",
+        translations="data/translations.zip",
+    params:
+        # The lambda is used to deactivate automatic wildcard expansion.
+        # https://github.com/snakemake/snakemake/blob/384d0066c512b0429719085f2cf886fdb97fd80a/snakemake/rules.py#L997-L1000
+        translations=lambda w: "data/translations/{gene}.fasta",
+    threads: 4
+    shell:
+        """
+        nextclade run -D {input.dataset} -j {threads}   --retry-reverse-complement \
+                  --output-fasta {output.alignment}  --output-translations {params.translations} \
+                  --output-insertions {output.insertions} {input.sequences}
+        zip -rj {output.translations} data/translations
+        """
+
+
+rule nextclade:
+    input:
+        sequences="results/sequences.fasta",
+        dataset="mpxv.zip",
+    output:
+        "data/nextclade.tsv",
+    threads: 4
+    shell:
+        """
+        nextclade run -D {input.dataset} -j {threads} --output-tsv {output}  {input.sequences}  --retry-reverse-complement
+        """
+
+
+rule join_metadata_clades:
+    input:
+        nextclade="data/nextclade.tsv",
+        metadata="data/metadata_raw.tsv",
+        nextclade_field_map=config["nextclade"]["field_map"],
+    output:
+        metadata="results/metadata.tsv",
+    params:
+        id_field=config["transform"]["id_field"],
+        nextclade_id_field=config["nextclade"]["id_field"],
+    shell:
+        """
+        export SUBSET_FIELDS=`awk 'NR>1 {{print $1}}' {input.nextclade_field_map} | tr '\n' ',' | sed 's/,$//g'`
+
+        csvtk -tl cut -f $SUBSET_FIELDS \
+            {input.nextclade} \
+        | csvtk -tl rename2 \
+            -F \
+            -f '*' \
+            -p '(.+)' \
+            -r '{{kv}}' \
+            -k {input.nextclade_field_map} \
+        | tsv-join -H \
+            --filter-file - \
+            --key-fields {params.nextclade_id_field} \
+            --data-fields {params.id_field} \
+            --append-fields '*' \
+            --write-all ? \
+            {input.metadata} \
+        | tsv-select -H --exclude {params.nextclade_id_field} \
+            > {output.metadata}
+        """
diff --git a/ingest/workflow/snakemake_rules/slack_notifications.smk b/ingest/workflow/snakemake_rules/slack_notifications.smk
new file mode 100644
index 0000000..9eb0463
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/slack_notifications.smk
@@ -0,0 +1,55 @@
+"""
+This part of the workflow handles various Slack notifications.
+Designed to be used internally by the Nextstrain team with hard-coded paths
+to files on AWS S3.
+
+All rules here require two environment variables:
+    * SLACK_TOKEN
+    * SLACK_CHANNELS
+"""
+import os
+import sys
+
+slack_envvars_defined = "SLACK_CHANNELS" in os.environ and "SLACK_TOKEN" in os.environ
+if not slack_envvars_defined:
+    print(
+        "ERROR: Slack notifications require two environment variables: 'SLACK_CHANNELS' and 'SLACK_TOKEN'.",
+        file=sys.stderr,
+    )
+    sys.exit(1)
+
+S3_SRC = "s3://nextstrain-data/files/workflows/mpox"
+
+
+rule notify_on_genbank_record_change:
+    input:
+        genbank_ndjson="data/genbank.ndjson",
+    output:
+        touch("data/notify/genbank-record-change.done"),
+    params:
+        s3_src=S3_SRC,
+    shell:
+        """
+        ./vendored/notify-on-record-change {input.genbank_ndjson} {params.s3_src:q}/genbank.ndjson.xz Genbank
+        """
+
+
+rule notify_on_metadata_diff:
+    input:
+        metadata="results/metadata.tsv",
+    output:
+        touch("data/notify/metadata-diff.done"),
+    params:
+        s3_src=S3_SRC,
+    shell:
+        """
+        ./vendored/notify-on-diff {input.metadata} {params.s3_src:q}/metadata.tsv.gz
+        """
+
+
+onstart:
+    shell("./vendored/notify-on-job-start Ingest nextstrain/mpox")
+
+
+onerror:
+    shell("./vendored/notify-on-job-fail Ingest nextstrain/mpox")
diff --git a/ingest/workflow/snakemake_rules/transform.smk b/ingest/workflow/snakemake_rules/transform.smk
new file mode 100644
index 0000000..fe7d7c1
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/transform.smk
@@ -0,0 +1,97 @@
+"""
+This part of the workflow handles transforming the data into standardized
+formats and expects input file
+
+    sequences_ndjson = "data/sequences.ndjson"
+
+This will produce output files as
+
+    metadata = "data/metadata_raw.tsv"
+    sequences = "results/sequences.fasta"
+
+Parameters are expected to be defined in `config.transform`.
+"""
+
+
+rule fetch_general_geolocation_rules:
+    output:
+        general_geolocation_rules="data/general-geolocation-rules.tsv",
+    params:
+        geolocation_rules_url=config["transform"]["geolocation_rules_url"],
+    shell:
+        """
+        curl {params.geolocation_rules_url} > {output.general_geolocation_rules}
+        """
+
+
+rule concat_geolocation_rules:
+    input:
+        general_geolocation_rules="data/general-geolocation-rules.tsv",
+        local_geolocation_rules=config["transform"]["local_geolocation_rules"],
+    output:
+        all_geolocation_rules="data/all-geolocation-rules.tsv",
+    shell:
+        """
+        cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}
+        """
+
+
+rule transform:
+    input:
+        sequences_ndjson="data/sequences.ndjson",
+        all_geolocation_rules="data/all-geolocation-rules.tsv",
+        annotations=config["transform"]["annotations"],
+    output:
+        metadata="data/metadata_raw.tsv",
+        sequences="results/sequences.fasta",
+    log:
+        "logs/transform.txt",
+    params:
+        field_map=config["transform"]["field_map"],
+        strain_regex=config["transform"]["strain_regex"],
+        strain_backup_fields=config["transform"]["strain_backup_fields"],
+        date_fields=config["transform"]["date_fields"],
+        expected_date_formats=config["transform"]["expected_date_formats"],
+        articles=config["transform"]["titlecase"]["articles"],
+        abbreviations=config["transform"]["titlecase"]["abbreviations"],
+        titlecase_fields=config["transform"]["titlecase"]["fields"],
+        authors_field=config["transform"]["authors_field"],
+        authors_default_value=config["transform"]["authors_default_value"],
+        abbr_authors_field=config["transform"]["abbr_authors_field"],
+        annotations_id=config["transform"]["annotations_id"],
+        metadata_columns=config["transform"]["metadata_columns"],
+        id_field=config["transform"]["id_field"],
+        sequence_field=config["transform"]["sequence_field"],
+    shell:
+        """
+        (cat {input.sequences_ndjson} \
+            | ./vendored/transform-field-names \
+                --field-map {params.field_map} \
+            | augur curate normalize-strings \
+            | ./vendored/transform-strain-names \
+                --strain-regex {params.strain_regex} \
+                --backup-fields {params.strain_backup_fields} \
+            | augur curate format-dates \
+                --date-fields {params.date_fields} \
+                --expected-date-formats {params.expected_date_formats} \
+            | ./vendored/transform-genbank-location \
+            | augur curate titlecase \
+                --titlecase-fields {params.titlecase_fields} \
+                --articles {params.articles} \
+                --abbreviations {params.abbreviations} \
+            | ./vendored/transform-authors \
+                --authors-field {params.authors_field} \
+                --default-value {params.authors_default_value} \
+                --abbr-authors-field {params.abbr_authors_field} \
+            | ./vendored/apply-geolocation-rules \
+                --geolocation-rules {input.all_geolocation_rules} \
+            | ./vendored/merge-user-metadata \
+                --annotations {input.annotations} \
+                --id-field {params.annotations_id} \
+            | ./bin/ndjson-to-tsv-and-fasta \
+                --metadata-columns {params.metadata_columns} \
+                --metadata {output.metadata} \
+                --fasta {output.sequences} \
+                --id-field {params.id_field} \
+                --sequence-field {params.sequence_field} ) 2>> {log}
+        """
diff --git a/ingest/workflow/snakemake_rules/trigger_rebuild.smk b/ingest/workflow/snakemake_rules/trigger_rebuild.smk
new file mode 100644
index 0000000..2e797ee
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/trigger_rebuild.smk
@@ -0,0 +1,22 @@
+"""
+This part of the workflow handles triggering new mpox builds after the
+latest metadata TSV and sequence FASTA files have been uploaded to S3.
+
+Designed to be used internally by the Nextstrain team with hard-coded paths
+to expected upload flag files.
+"""
+
+
+rule trigger_build:
+    """
+    Triggering monekypox builds via repository action type `rebuild`.
+    """
+    input:
+        metadata_upload="data/upload/s3/metadata.tsv.gz.done",
+        fasta_upload="data/upload/s3/sequences.fasta.xz.done",
+    output:
+        touch("data/trigger/rebuild.done"),
+    shell:
+        """
+        ./vendored/trigger-on-new-data nextstrain/mpox rebuild {input.metadata_upload} {input.fasta_upload}
+        """
diff --git a/ingest/workflow/snakemake_rules/upload.smk b/ingest/workflow/snakemake_rules/upload.smk
new file mode 100644
index 0000000..60c5c9b
--- /dev/null
+++ b/ingest/workflow/snakemake_rules/upload.smk
@@ -0,0 +1,64 @@
+"""
+This part of the workflow handles uploading files to a specified destination.
+
+Uses predefined wildcard `file_to_upload` determine input and predefined
+wildcard `remote_file_name` as the remote file name in the specified destination.
+
+Produces output files as `data/upload/{upload_target_name}/{remote_file_name}.done`.
+
+Currently only supports uploads to AWS S3, but additional upload rules can
+be easily added as long as they follow the output pattern described above.
+"""
+import os
+
+slack_envvars_defined = "SLACK_CHANNELS" in os.environ and "SLACK_TOKEN" in os.environ
+send_notifications = (
+    config.get("send_slack_notifications", False) and slack_envvars_defined
+)
+
+
+def _get_upload_inputs(wildcards):
+    """
+    If the file_to_upload has Slack notifications that depend on diffs with S3 files,
+    then we want the upload rule to run after the notification rule.
+
+    This function is mostly to keep track of which flag files to expect for
+    the rules in `slack_notifications.smk`, so it only includes flag files if
+    `send_notifications` is True.
+    """
+    inputs = {
+        "file_to_upload": config["upload"]["s3"]["files_to_upload"][
+            wildcards.remote_file_name
+        ],
+    }
+
+    if send_notifications:
+        flag_file = []
+
+        if file_to_upload == "data/genbank.ndjson":
+            flag_file = "data/notify/genbank-record-change.done"
+        elif file_to_upload == "results/metadata.tsv":
+            flag_file = "data/notify/metadata-diff.done"
+
+        inputs["notify_flag_file"] = flag_file
+
+    return inputs
+
+
+rule upload_to_s3:
+    input:
+        unpack(_get_upload_inputs),
+    output:
+        "data/upload/s3/{remote_file_name}.done",
+    params:
+        quiet="" if send_notifications else "--quiet",
+        s3_dst=config["upload"].get("s3", {}).get("dst", ""),
+        cloudfront_domain=config["upload"].get("s3", {}).get("cloudfront_domain", ""),
+    shell:
+        """
+        ./vendored/upload-to-s3 \
+            {params.quiet} \
+            {input.file_to_upload:q} \
+            {params.s3_dst:q}/{wildcards.remote_file_name:q} \
+            {params.cloudfront_domain} 2>&1 | tee {output}
+        """

From 0496748d63303aa310c86f91fc9ca58e5c79c7af Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:26:33 -0800
Subject: [PATCH 02/28] git subrepo clone (merge)
 https://github.com/nextstrain/ingest ingest/vendored

subrepo:
  subdir:   "ingest/vendored"
  merged:   "a0faef5"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "a0faef5"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
---
 ingest/vendored/.cramrc                       |   3 +
 .../vendored/.github/pull_request_template.md |  16 ++
 ingest/vendored/.github/workflows/ci.yaml     |  23 ++
 ingest/vendored/.gitrepo                      |  12 +
 ingest/vendored/.shellcheckrc                 |   6 +
 ingest/vendored/README.md                     | 140 +++++++++++
 ingest/vendored/apply-geolocation-rules       | 234 ++++++++++++++++++
 ingest/vendored/cloudfront-invalidate         |  42 ++++
 ingest/vendored/download-from-s3              |  48 ++++
 ingest/vendored/fetch-from-ncbi-entrez        |  70 ++++++
 ingest/vendored/merge-user-metadata           |  55 ++++
 ingest/vendored/notify-on-diff                |  35 +++
 ingest/vendored/notify-on-job-fail            |  23 ++
 ingest/vendored/notify-on-job-start           |  27 ++
 ingest/vendored/notify-on-record-change       |  53 ++++
 ingest/vendored/notify-slack                  |  56 +++++
 ingest/vendored/s3-object-exists              |   8 +
 ingest/vendored/sha256sum                     |  15 ++
 .../transform-strain-names.t                  |  17 ++
 ingest/vendored/transform-authors             |  66 +++++
 ingest/vendored/transform-field-names         |  48 ++++
 ingest/vendored/transform-genbank-location    |  43 ++++
 ingest/vendored/transform-strain-names        |  50 ++++
 ingest/vendored/trigger                       |  56 +++++
 ingest/vendored/trigger-on-new-data           |  32 +++
 ingest/vendored/upload-to-s3                  |  78 ++++++
 26 files changed, 1256 insertions(+)
 create mode 100644 ingest/vendored/.cramrc
 create mode 100644 ingest/vendored/.github/pull_request_template.md
 create mode 100644 ingest/vendored/.github/workflows/ci.yaml
 create mode 100644 ingest/vendored/.gitrepo
 create mode 100644 ingest/vendored/.shellcheckrc
 create mode 100644 ingest/vendored/README.md
 create mode 100755 ingest/vendored/apply-geolocation-rules
 create mode 100755 ingest/vendored/cloudfront-invalidate
 create mode 100755 ingest/vendored/download-from-s3
 create mode 100755 ingest/vendored/fetch-from-ncbi-entrez
 create mode 100755 ingest/vendored/merge-user-metadata
 create mode 100755 ingest/vendored/notify-on-diff
 create mode 100755 ingest/vendored/notify-on-job-fail
 create mode 100755 ingest/vendored/notify-on-job-start
 create mode 100755 ingest/vendored/notify-on-record-change
 create mode 100755 ingest/vendored/notify-slack
 create mode 100755 ingest/vendored/s3-object-exists
 create mode 100755 ingest/vendored/sha256sum
 create mode 100644 ingest/vendored/tests/transform-strain-names/transform-strain-names.t
 create mode 100755 ingest/vendored/transform-authors
 create mode 100755 ingest/vendored/transform-field-names
 create mode 100755 ingest/vendored/transform-genbank-location
 create mode 100755 ingest/vendored/transform-strain-names
 create mode 100755 ingest/vendored/trigger
 create mode 100755 ingest/vendored/trigger-on-new-data
 create mode 100755 ingest/vendored/upload-to-s3

diff --git a/ingest/vendored/.cramrc b/ingest/vendored/.cramrc
new file mode 100644
index 0000000..153d20f
--- /dev/null
+++ b/ingest/vendored/.cramrc
@@ -0,0 +1,3 @@
+[cram]
+shell = /bin/bash
+indent = 2
diff --git a/ingest/vendored/.github/pull_request_template.md b/ingest/vendored/.github/pull_request_template.md
new file mode 100644
index 0000000..ed4a5b2
--- /dev/null
+++ b/ingest/vendored/.github/pull_request_template.md
@@ -0,0 +1,16 @@
+### Description of proposed changes
+
+<!-- What is the goal of this pull request? What does this pull request change? -->
+
+### Related issue(s)
+
+<!-- Link any related issues here. -->
+
+### Checklist
+
+<!-- Make sure checks are successful at the bottom of the PR. -->
+
+- [ ] Checks pass
+- [ ] If adding a script, add an entry for it in the README.
+
+<!-- 🙌 Thank you for contributing to Nextstrain! ✨ -->
diff --git a/ingest/vendored/.github/workflows/ci.yaml b/ingest/vendored/.github/workflows/ci.yaml
new file mode 100644
index 0000000..c6a218a
--- /dev/null
+++ b/ingest/vendored/.github/workflows/ci.yaml
@@ -0,0 +1,23 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+  workflow_dispatch:
+
+jobs:
+  shellcheck:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: nextstrain/.github/actions/shellcheck@master
+
+  cram:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+      - run: pip install cram
+      - run: cram tests/
\ No newline at end of file
diff --git a/ingest/vendored/.gitrepo b/ingest/vendored/.gitrepo
new file mode 100644
index 0000000..40287c2
--- /dev/null
+++ b/ingest/vendored/.gitrepo
@@ -0,0 +1,12 @@
+; DO NOT EDIT (unless you know what you are doing)
+;
+; This subdirectory is a git "subrepo", and this file is maintained by the
+; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
+;
+[subrepo]
+	remote = https://github.com/nextstrain/ingest
+	branch = main
+	commit = a0faef53a0c6e7cc4057209454ef0852875dc3a9
+	parent = 951d6b515a1b06f84d63dc0e2ee22bd13fe9d309
+	method = merge
+	cmdver = 0.4.6
diff --git a/ingest/vendored/.shellcheckrc b/ingest/vendored/.shellcheckrc
new file mode 100644
index 0000000..ebed438
--- /dev/null
+++ b/ingest/vendored/.shellcheckrc
@@ -0,0 +1,6 @@
+# Use of this file requires Shellcheck v0.7.0 or newer.
+#
+# SC2064 - We intentionally want variables to expand immediately within traps
+#          so the trap can not fail due to variable interpolation later.
+#
+disable=SC2064
diff --git a/ingest/vendored/README.md b/ingest/vendored/README.md
new file mode 100644
index 0000000..0ad83f4
--- /dev/null
+++ b/ingest/vendored/README.md
@@ -0,0 +1,140 @@
+# ingest
+
+Shared internal tooling for pathogen data ingest.  Used by our individual
+pathogen repos which produce Nextstrain builds.  Expected to be vendored by
+each pathogen repo using `git subtree`.
+
+Some tools may only live here temporarily before finding a permanent home in
+`augur curate` or Nextstrain CLI.  Others may happily live out their days here.
+
+## Vendoring
+
+Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
+(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)
+
+For a list of Nextstrain repos that are currently using this method, use [this
+GitHub code search](https://github.com/search?type=code&q=org%3Anextstrain+subrepo+%22remote+%3D+https%3A%2F%2Fgithub.com%2Fnextstrain%2Fingest%22).
+
+If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
+Then add the latest ingest scripts to the pathogen repo by running:
+
+```
+git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
+```
+
+Any future updates of ingest scripts can be pulled in with:
+
+```
+git subrepo pull ingest/vendored
+```
+
+If you run into merge conflicts and would like to pull in a fresh copy of the
+latest ingest scripts, pull with the `--force` flag:
+
+```
+git subrepo pull ingest/vendored --force
+```
+
+> **Warning**
+> Beware of rebasing/dropping the parent commit of a `git subrepo` update
+
+`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
+which includes the hash for the parent commit in the pathogen repos.
+If this hash no longer exists in the commit history, there will be errors when
+running future `git subrepo pull` commands.
+
+If you run into an error similar to the following:
+```
+$ git subrepo pull ingest/vendored
+git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
+fatal: not a valid object name: ''
+```
+Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
+sure the commit exists in the commit history. Update to the appropriate parent
+commit hash if needed.
+
+## History
+
+Much of this tooling originated in
+[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
+[mpox's ingest/](https://github.com/nextstrain/mpox/tree/@/ingest/). It
+subsequently proliferated from [mpox][] to other pathogen repos ([rsv][],
+[zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily thru
+copying.  To [counter that
+proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
+this repo was made.
+
+[mpox]: https://github.com/nextstrain/mpox
+[rsv]: https://github.com/nextstrain/rsv
+[zika]: https://github.com/nextstrain/zika/pull/24
+[dengue]: https://github.com/nextstrain/dengue/pull/10
+[hepatitisB]: https://github.com/nextstrain/hepatitisB
+[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov
+
+## Elsewhere
+
+The creation of this repo, in both the abstract and concrete, and the general
+approach to "ingest" has been discussed in various internal places, including:
+
+- https://github.com/nextstrain/private/issues/59
+- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
+- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
+- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
+- _…many others_
+
+## Scripts
+
+Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.
+
+- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
+- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
+  If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
+- [notify-slack](notify-slack) - Send message or file to Slack
+- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
+- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
+- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
+  A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
+
+NCBI interaction scripts that are useful for fetching public metadata and sequences.
+
+- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
+  Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
+
+Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.
+
+Potential Nextstrain CLI scripts
+
+- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
+- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
+  This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
+- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
+  Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
+  Adds the following user defined metadata to uploaded S3 object:
+    - `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
+    - `recordcount` - the line count of the file
+- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
+  Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.
+
+Potential augur curate scripts
+
+- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
+- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
+- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
+- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
+- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
+- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.
+
+## Software requirements
+
+Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.
+
+## Testing
+
+Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.
+
+For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,
+
+1. Download Cram: `pip install cram`
+2. Run the tests: `cram tests/`
diff --git a/ingest/vendored/apply-geolocation-rules b/ingest/vendored/apply-geolocation-rules
new file mode 100755
index 0000000..776cf16
--- /dev/null
+++ b/ingest/vendored/apply-geolocation-rules
@@ -0,0 +1,234 @@
+#!/usr/bin/env python3
+"""
+Applies user curated geolocation rules to the geolocation fields in the NDJSON
+records from stdin. The modified records are output to stdout. This does not do
+any additional transformations on top of the user curations.
+"""
+import argparse
+import json
+from collections import defaultdict
+from sys import exit, stderr, stdin, stdout
+
+
+class CyclicGeolocationRulesError(Exception):
+    pass
+
+
+def load_geolocation_rules(geolocation_rules_file):
+    """
+    Loads the geolocation rules from the provided *geolocation_rules_file*.
+    Returns the rules as a dict:
+    {
+        regions: {
+            countries: {
+                divisions: {
+                    locations: corrected_geolocations_tuple
+                }
+            }
+        }
+    }
+    """
+    geolocation_rules = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))
+    with open(geolocation_rules_file, 'r') as rules_fh:
+        for line in rules_fh:
+            # ignore comments
+            if line.strip()=="" or line.lstrip()[0] == '#':
+                continue
+
+            row = line.strip('\n').split('\t')
+            # Skip lines that cannot be split into raw and annotated geolocations
+            if len(row) != 2:
+                print(
+                    f"WARNING: Could not decode geolocation rule {line!r}.",
+                    "Please make sure rules are formatted as",
+                    "'region/country/division/location<tab>region/country/division/location'.",
+                    file=stderr)
+                continue
+
+            # remove trailing comments
+            row[-1] = row[-1].partition('#')[0].rstrip()
+            raw , annot = tuple( row[0].split('/') ) , tuple( row[1].split('/') )
+
+            # Skip lines where raw or annotated geolocations cannot be split into 4 fields
+            if len(raw) != 4:
+                print(
+                    f"WARNING: Could not decode the raw geolocation {row[0]!r}.",
+                    "Please make sure it is formatted as 'region/country/division/location'.",
+                    file=stderr
+                )
+                continue
+
+            if len(annot) != 4:
+                print(
+                    f"WARNING: Could not decode the annotated geolocation {row[1]!r}.",
+                    "Please make sure it is formatted as 'region/country/division/location'.",
+                    file=stderr
+                )
+                continue
+
+
+            geolocation_rules[raw[0]][raw[1]][raw[2]][raw[3]] = annot
+
+    return geolocation_rules
+
+
+def get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal = None):
+    """
+    Gets the annotated geolocation for the *raw_geolocation* in the provided
+    *geolocation_rules*.
+
+    Recursively traverses the *geolocation_rules* until we get the annotated
+    geolocation, which must be a Tuple. Returns `None` if there are no
+    applicable rules for the provided *raw_geolocation*.
+
+    Rules are applied in the order of region, country, division, location.
+    First checks the provided raw values for geolocation fields, then if there
+    are not matches, tries to use general rules marked with '*'.
+    """
+    # Always instantiate the rule traversal as an empty list if not provided,
+    # e.g. the first call of this recursive function
+    if rule_traversal is None:
+        rule_traversal = []
+
+    current_rules = geolocation_rules
+    # Traverse the geolocation rules based using the rule_traversal values
+    for field_value in rule_traversal:
+        current_rules = current_rules.get(field_value)
+        # If we hit `None`, then we know there are no matching rules, so stop the rule traversal
+        if current_rules is None:
+            break
+
+    # We've found the tuple of the annotated geolocation
+    if isinstance(current_rules, tuple):
+        return current_rules
+
+    # We've reach the next level of geolocation rules,
+    # so try to traverse the rules with the next target in raw_geolocation
+    if isinstance(current_rules, dict):
+        next_traversal_target = raw_geolocation[len(rule_traversal)]
+        rule_traversal.append(next_traversal_target)
+        return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal)
+
+    # We did not find any matching rule for the last traversal target
+    if current_rules is None:
+        # If we've used all general rules and we still haven't found a match,
+        # then there are no applicable rules for this geolocation
+        if all(value == '*' for value in rule_traversal):
+            return None
+
+        # If we failed to find matching rule with a general rule as the last
+        # traversal target, then delete all trailing '*'s to reset rule_traversal
+        # to end with the last index that is currently NOT a '*'
+        # [A, *, B, *] => [A, *, B]
+        # [A, B, *, *] => [A, B]
+        # [A, *, *, *] => [A]
+        if rule_traversal[-1] == '*':
+            # Find the index of the first of the consecutive '*' from the
+            # end of the rule_traversal
+            # [A, *, B, *] => first_consecutive_general_rule_index = 3
+            # [A, B, *, *] => first_consecutive_general_rule_index = 2
+            # [A, *, *, *] => first_consecutive_general_rule_index = 1
+            for index, field_value in reversed(list(enumerate(rule_traversal))):
+                if field_value == '*':
+                    first_consecutive_general_rule_index = index
+                else:
+                    break
+
+            rule_traversal = rule_traversal[:first_consecutive_general_rule_index]
+
+        # Set the final value to '*' in hopes that by moving to a general rule,
+        # we can find a matching rule.
+        rule_traversal[-1] = '*'
+
+        return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal)
+
+
+def transform_geolocations(geolocation_rules, geolocation):
+    """
+    Transform the provided *geolocation* by looking it up in the provided
+    *geolocation_rules*.
+
+    This will use all rules that apply to the geolocation and rules will
+    be applied in the order of region, country, division, location.
+
+    Returns the original geolocation if no geolocation rules apply.
+
+    Raises a `CyclicGeolocationRulesError` if more than 1000 rules have
+    been applied to the raw geolocation.
+    """
+    transformed_values = geolocation
+    rules_applied = 0
+    continue_to_apply = True
+
+    while continue_to_apply:
+        annotated_values = get_annotated_geolocation(geolocation_rules, transformed_values)
+
+        # Stop applying rules if no annotated values were found
+        if annotated_values is None:
+            continue_to_apply = False
+        else:
+            rules_applied += 1
+
+            if rules_applied > 1000:
+                raise CyclicGeolocationRulesError(
+                    "ERROR: More than 1000 geolocation rules applied on the same entry {geolocation!r}."
+                )
+
+            # Create a new list of values for comparison to previous values
+            new_values = list(transformed_values)
+            for index, value in enumerate(annotated_values):
+                # Keep original value if annotated value is '*'
+                if value != '*':
+                    new_values[index] = value
+
+            # Stop applying rules if this rule did not change the values,
+            # since this means we've reach rules with '*' that no longer change values
+            if new_values == transformed_values:
+                continue_to_apply = False
+
+            transformed_values = new_values
+
+    return transformed_values
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--region-field", default="region",
+        help="Field that contains regions in NDJSON records.")
+    parser.add_argument("--country-field", default="country",
+        help="Field that contains countries in NDJSON records.")
+    parser.add_argument("--division-field", default="division",
+        help="Field that contains divisions in NDJSON records.")
+    parser.add_argument("--location-field", default="location",
+        help="Field that contains location in NDJSON records.")
+    parser.add_argument("--geolocation-rules", metavar="TSV", required=True,
+        help="TSV file of geolocation rules with the format: " +
+             "'<raw_geolocation><tab><annotated_geolocation>' where the raw and annotated geolocations " +
+             "are formatted as '<region>/<country>/<division>/<location>'. " +
+             "If creating a general rule, then the raw field value can be substituted with '*'." +
+             "Lines starting with '#' will be ignored as comments." +
+             "Trailing '#' will be ignored as comments.")
+
+    args = parser.parse_args()
+
+    location_fields = [args.region_field, args.country_field, args.division_field, args.location_field]
+
+    geolocation_rules = load_geolocation_rules(args.geolocation_rules)
+
+    for record in stdin:
+        record = json.loads(record)
+
+        try:
+            annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
+        except CyclicGeolocationRulesError as e:
+            print(e, file=stderr)
+            exit(1)
+
+        for index, field in enumerate(location_fields):
+            record[field] = annotated_values[index]
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/cloudfront-invalidate b/ingest/vendored/cloudfront-invalidate
new file mode 100755
index 0000000..dbea398
--- /dev/null
+++ b/ingest/vendored/cloudfront-invalidate
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# Originally from @tsibley's gist: https://gist.github.com/tsibley/a66262d341dedbea39b02f27e2837ea8
+set -euo pipefail
+
+main() {
+    local domain="$1"
+    shift
+    local paths=("$@")
+    local distribution invalidation
+
+    echo "-> Finding CloudFront distribution"
+    distribution=$(
+        aws cloudfront list-distributions \
+            --query "DistributionList.Items[?contains(Aliases.Items, \`$domain\`)] | [0].Id" \
+            --output text
+    )
+
+    if [[ -z $distribution || $distribution == None ]]; then
+        exec >&2
+        echo "Unable to find CloudFront distribution id for $domain"
+        echo
+        echo "Are your AWS CLI credentials for the right account?"
+        exit 1
+    fi
+
+    echo "-> Creating CloudFront invalidation for distribution $distribution"
+    invalidation=$(
+        aws cloudfront create-invalidation \
+            --distribution-id "$distribution" \
+            --paths "${paths[@]}" \
+            --query Invalidation.Id \
+            --output text
+    )
+
+    echo "-> Waiting for CloudFront invalidation $invalidation to complete"
+    echo "   Ctrl-C to stop waiting."
+    aws cloudfront wait invalidation-completed \
+        --distribution-id "$distribution" \
+        --id "$invalidation"
+}
+
+main "$@"
diff --git a/ingest/vendored/download-from-s3 b/ingest/vendored/download-from-s3
new file mode 100755
index 0000000..4981186
--- /dev/null
+++ b/ingest/vendored/download-from-s3
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+bin="$(dirname "$0")"
+
+main() {
+    local src="${1:?A source s3:// URL is required as the first argument.}"
+    local dst="${2:?A destination file path is required as the second argument.}"
+    # How many lines to subsample to. 0 means no subsampling. Optional.
+    # It is not advised to use this for actual subsampling! This is intended to be
+    # used for debugging workflows with large datasets such as ncov-ingest as
+    # described in https://github.com/nextstrain/ncov-ingest/pull/367
+
+    # Uses `tsv-sample` to subsample, so it will not work as expected with files
+    # that have a single record split across multiple lines (i.e. FASTA sequences)
+    local n="${3:-0}"
+
+    local s3path="${src#s3://}"
+    local bucket="${s3path%%/*}"
+    local key="${s3path#*/}"
+
+    local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
+    dst_hash="$("$bin/sha256sum" < "$dst" || true)"
+    src_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
+
+    echo "[ INFO] Downloading $src → $dst"
+    if [[ $src_hash != "$dst_hash" ]]; then
+        aws s3 cp --no-progress "$src" - |
+        if [[ "$src" == *.gz ]]; then
+            gunzip -cfq
+        elif  [[ "$src" == *.xz ]]; then
+            xz -T0 -dcq
+        elif [[ "$src" == *.zst ]]; then
+            zstd -T0 -dcq
+        else
+            cat
+        fi |
+        if [[ "$n" -gt 0 ]]; then
+            tsv-sample -H -i -n "$n"
+        else
+            cat
+        fi >"$dst"
+    else
+        echo "[ INFO] Files are identical, skipping download"
+    fi
+}
+
+main "$@"
diff --git a/ingest/vendored/fetch-from-ncbi-entrez b/ingest/vendored/fetch-from-ncbi-entrez
new file mode 100755
index 0000000..194a0c8
--- /dev/null
+++ b/ingest/vendored/fetch-from-ncbi-entrez
@@ -0,0 +1,70 @@
+#!/usr/bin/env python3
+"""
+Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file.
+"""
+import json
+import argparse
+from Bio import SeqIO, Entrez
+
+# To use the efetch API, the docs indicate only around 10,000 records should be fetched per request
+# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
+# However, in my testing with HepB, the max records returned was 9,999
+#   - Jover, 16 August 2023
+BATCH_SIZE = 9999
+
+Entrez.email = "hello@nextstrain.org"
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument('--term', required=True, type=str,
+        help='Genbank search term. Replace spaces with "+", e.g. "Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]"')
+    parser.add_argument('--output', required=True, type=str, help='Output file (Genbank)')
+    return parser.parse_args()
+
+
+def get_esearch_history(term):
+    """
+    Search for the provided *term* via ESearch and store the results using the
+    Entrez history server.¹
+
+    Returns the total count of returned records, query key, and web env needed
+    to access the records from the server.
+
+    ¹ https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Using_the_Entrez_History_Server
+    """
+    handle = Entrez.esearch(db="nucleotide", term=term, retmode="json", usehistory="y", retmax=0)
+    esearch_result = json.loads(handle.read())['esearchresult']
+    print(f"Search term {term!r} returned {esearch_result['count']} IDs.")
+    return {
+        "count": int(esearch_result["count"]),
+        "query_key": esearch_result["querykey"],
+        "web_env": esearch_result["webenv"]
+    }
+
+
+def fetch_from_esearch_history(count, query_key, web_env):
+    """
+    Fetch records in batches from Entrez history server using the provided
+    *query_key* and *web_env* and yields them as a BioPython SeqRecord iterator.
+    """
+    print(f"Fetching GenBank records in batches of n={BATCH_SIZE}")
+
+    for start in range(0, count, BATCH_SIZE):
+        handle = Entrez.efetch(
+            db="nucleotide",
+            query_key=query_key,
+            webenv=web_env,
+            retstart=start,
+            retmax=BATCH_SIZE,
+            rettype="gb",
+            retmode="text")
+
+        yield SeqIO.parse(handle, "genbank")
+
+
+if __name__=="__main__":
+    args = parse_args()
+
+    with open(args.output, "w") as output_handle:
+        for batch_results in fetch_from_esearch_history(**get_esearch_history(args.term)):
+            SeqIO.write(batch_results, output_handle, "genbank")
diff --git a/ingest/vendored/merge-user-metadata b/ingest/vendored/merge-user-metadata
new file mode 100755
index 0000000..341c2df
--- /dev/null
+++ b/ingest/vendored/merge-user-metadata
@@ -0,0 +1,55 @@
+#!/usr/bin/env python3
+"""
+Merges user curated annotations with the NDJSON records from stdin, with the user
+curations overwriting the existing fields. The modified records are output
+to stdout. This does not do any additional transformations on top of the user
+curations.
+"""
+import argparse
+import csv
+import json
+from collections import defaultdict
+from sys import exit, stdin, stderr, stdout
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--annotations", metavar="TSV", required=True,
+        help="Manually curated annotations TSV file. " +
+             "The TSV should not have a header and should have exactly three columns: " +
+             "id to match existing metadata, field name, and field value. " +
+             "If there are multiple annotations for the same id and field, then the last value is used. " +
+             "Lines starting with '#' are treated as comments. " +
+             "Any '#' after the field value are treated as comments.")
+    parser.add_argument("--id-field", default="accession",
+        help="The ID field in the metadata to use to merge with the annotations.")
+
+    args = parser.parse_args()
+
+    annotations = defaultdict(dict)
+    with open(args.annotations, 'r') as annotations_fh:
+        csv_reader = csv.reader(annotations_fh, delimiter='\t')
+        for row in csv_reader:
+            if not row or row[0].lstrip()[0] == '#':
+                    continue
+            elif len(row) != 3:
+                print("WARNING: Could not decode annotation line " + "\t".join(row), file=stderr)
+                continue
+            id, field, value = row
+            annotations[id][field] = value.partition('#')[0].rstrip()
+
+    for record in stdin:
+        record = json.loads(record)
+
+        record_id = record.get(args.id_field)
+        if record_id is None:
+            print(f"ERROR: ID field {args.id_field!r} does not exist in record", file=stderr)
+            exit(1)
+
+        record.update(annotations.get(record_id, {}))
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/notify-on-diff b/ingest/vendored/notify-on-diff
new file mode 100755
index 0000000..ddbe7da
--- /dev/null
+++ b/ingest/vendored/notify-on-diff
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+bin="$(dirname "$0")"
+
+src="${1:?A source file is required as the first argument.}"
+dst="${2:?A destination s3:// URL is required as the second argument.}"
+
+dst_local="$(mktemp -t s3-file-XXXXXX)"
+diff="$(mktemp -t diff-XXXXXX)"
+
+trap "rm -f '$dst_local' '$diff'" EXIT
+
+# if the file is not already present, just exit
+"$bin"/s3-object-exists "$dst" || exit 0
+
+"$bin"/download-from-s3 "$dst" "$dst_local"
+
+# diff's exit code is 0 for no differences, 1 for differences found, and >1 for errors
+diff_exit_code=0
+diff "$dst_local" "$src" > "$diff" || diff_exit_code=$?
+
+if [[ "$diff_exit_code" -eq 1 ]]; then
+    echo "Notifying Slack about diff."
+    "$bin"/notify-slack --upload "$src.diff" < "$diff"
+elif [[ "$diff_exit_code" -gt 1 ]]; then
+    echo "Notifying Slack about diff failure"
+    "$bin"/notify-slack "Diff failed for $src"
+else
+    echo "No change in $src."
+fi
diff --git a/ingest/vendored/notify-on-job-fail b/ingest/vendored/notify-on-job-fail
new file mode 100755
index 0000000..7dd2409
--- /dev/null
+++ b/ingest/vendored/notify-on-job-fail
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+: "${AWS_BATCH_JOB_ID:=}"
+: "${GITHUB_RUN_ID:=}"
+
+bin="$(dirname "$0")"
+job_name="${1:?A job name is required as the first argument}"
+github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
+
+echo "Notifying Slack about failed ${job_name} job."
+message="❌ ${job_name} job has FAILED 😞 "
+
+if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
+    message+="See AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>) for error details. "
+elif [[ -n "${GITHUB_RUN_ID}" ]]; then
+    message+="See GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details. "
+fi
+
+"$bin"/notify-slack "$message"
diff --git a/ingest/vendored/notify-on-job-start b/ingest/vendored/notify-on-job-start
new file mode 100755
index 0000000..1c8ce7d
--- /dev/null
+++ b/ingest/vendored/notify-on-job-start
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+: "${AWS_BATCH_JOB_ID:=}"
+: "${GITHUB_RUN_ID:=}"
+
+bin="$(dirname "$0")"
+job_name="${1:?A job name is required as the first argument}"
+github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
+build_dir="${3:-ingest}"
+
+echo "Notifying Slack about started ${job_name} job."
+message="${job_name} job has started."
+
+if [[ -n "${GITHUB_RUN_ID}" ]]; then
+  message+=" The job was submitted by GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
+fi
+
+if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
+  message+=" The job was launched as AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>)."
+  message+=" Follow along in your local clone of ${github_repo} with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ${build_dir}"'```'
+fi
+
+"$bin"/notify-slack "$message"
diff --git a/ingest/vendored/notify-on-record-change b/ingest/vendored/notify-on-record-change
new file mode 100755
index 0000000..f424252
--- /dev/null
+++ b/ingest/vendored/notify-on-record-change
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+bin="$(dirname "$0")"
+
+src="${1:?A source ndjson file is required as the first argument.}"
+dst="${2:?A destination ndjson s3:// URL is required as the second argument.}"
+source_name=${3:?A record source name is required as the third argument.}
+
+# if the file is not already present, just exit
+"$bin"/s3-object-exists "$dst" || exit 0
+
+s3path="${dst#s3://}"
+bucket="${s3path%%/*}"
+key="${s3path#*/}"
+
+src_record_count="$(wc -l < "$src")"
+
+# Try getting record count from S3 object metadata
+dst_record_count="$(aws s3api head-object --bucket "$bucket" --key "$key" --query "Metadata.recordcount || ''" --output text 2>/dev/null || true)"
+if [[ -z "$dst_record_count" ]]; then
+  # This object doesn't have the record count stored as metadata
+  # We have to download it and count the lines locally
+  dst_record_count="$(wc -l < <(aws s3 cp --no-progress "$dst" - | xz -T0 -dcfq))"
+fi
+
+added_records="$(( src_record_count - dst_record_count ))"
+
+printf "%'4d %s\n" "$src_record_count" "$src"
+printf "%'4d %s\n" "$dst_record_count" "$dst"
+printf "%'4d added records\n" "$added_records"
+
+slack_message=""
+
+if [[ $added_records -gt 0 ]]; then
+    echo "Notifying Slack about added records (n=$added_records)"
+    slack_message="📈 New records (n=$added_records) found on $source_name."
+
+elif [[ $added_records -lt 0 ]]; then
+    echo "Notifying Slack about fewer records (n=$added_records)"
+    slack_message="📉 Fewer records (n=$added_records) found on $source_name."
+
+else
+    echo "Notifying Slack about same number of records"
+    slack_message="⛔ No new records found on $source_name."
+fi
+
+slack_message+=" (Total record count: $src_record_count)"
+
+"$bin"/notify-slack "$slack_message"
diff --git a/ingest/vendored/notify-slack b/ingest/vendored/notify-slack
new file mode 100755
index 0000000..a343435
--- /dev/null
+++ b/ingest/vendored/notify-slack
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+upload=0
+output=/dev/null
+thread_ts=""
+broadcast=0
+args=()
+
+for arg; do
+    case "$arg" in
+        --upload)
+            upload=1;;
+        --output=*)
+            output="${arg#*=}";;
+        --thread-ts=*)
+            thread_ts="${arg#*=}";;
+        --broadcast)
+            broadcast=1;;
+        *)
+            args+=("$arg");;
+    esac
+done
+
+set -- "${args[@]}"
+
+text="${1:?Some message text is required.}"
+
+if [[ "$upload" == 1 ]]; then
+    echo "Uploading data to Slack with the message: $text"
+    curl https://slack.com/api/files.upload \
+        --header "Authorization: Bearer $SLACK_TOKEN" \
+        --form-string channels="$SLACK_CHANNELS" \
+        --form-string title="$text" \
+        --form-string filename="$text" \
+        --form-string thread_ts="$thread_ts" \
+        --form file=@/dev/stdin \
+        --form filetype=text \
+        --fail --silent --show-error \
+        --http1.1 \
+        --output "$output"
+else
+    echo "Posting Slack message: $text"
+    curl https://slack.com/api/chat.postMessage \
+        --header "Authorization: Bearer $SLACK_TOKEN" \
+        --form-string channel="$SLACK_CHANNELS" \
+        --form-string text="$text" \
+        --form-string thread_ts="$thread_ts" \
+        --form-string reply_broadcast="$broadcast" \
+        --fail --silent --show-error \
+        --http1.1 \
+        --output "$output"
+fi
diff --git a/ingest/vendored/s3-object-exists b/ingest/vendored/s3-object-exists
new file mode 100755
index 0000000..679c20a
--- /dev/null
+++ b/ingest/vendored/s3-object-exists
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+url="${1#s3://}"
+bucket="${url%%/*}"
+key="${url#*/}"
+
+aws s3api head-object --bucket "$bucket" --key "$key" &>/dev/null
diff --git a/ingest/vendored/sha256sum b/ingest/vendored/sha256sum
new file mode 100755
index 0000000..32d7ef8
--- /dev/null
+++ b/ingest/vendored/sha256sum
@@ -0,0 +1,15 @@
+#!/usr/bin/env python3
+"""
+Portable sha256sum utility.
+"""
+from hashlib import sha256
+from sys import stdin
+
+chunk_size = 5 * 1024**2 # 5 MiB
+
+h = sha256()
+
+for chunk in iter(lambda: stdin.buffer.read(chunk_size), b""):
+    h.update(chunk)
+
+print(h.hexdigest())
diff --git a/ingest/vendored/tests/transform-strain-names/transform-strain-names.t b/ingest/vendored/tests/transform-strain-names/transform-strain-names.t
new file mode 100644
index 0000000..1c05df7
--- /dev/null
+++ b/ingest/vendored/tests/transform-strain-names/transform-strain-names.t
@@ -0,0 +1,17 @@
+Look for strain name in "strain" or a list of backup fields.
+
+If strain entry exists, do not do anything.
+
+  $ echo '{"strain": "i/am/a/strain", "strain_s": "other"}' \
+  >   | $TESTDIR/../../transform-strain-names \
+  >       --strain-regex '^.+$' \
+  >       --backup-fields strain_s accession
+  {"strain":"i/am/a/strain","strain_s":"other"}
+
+If strain entry does not exists, search the backup fields
+
+  $ echo '{"strain_s": "other"}' \
+  >   | $TESTDIR/../../transform-strain-names \
+  >       --strain-regex '^.+$' \
+  >       --backup-fields accession strain_s 
+  {"strain_s":"other","strain":"other"}
\ No newline at end of file
diff --git a/ingest/vendored/transform-authors b/ingest/vendored/transform-authors
new file mode 100755
index 0000000..0bade20
--- /dev/null
+++ b/ingest/vendored/transform-authors
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+"""
+Abbreviates a full list of authors to be '<first author> et al.' of the NDJSON
+record from stdin and outputs modified records to stdout.
+
+Note: This is a "best effort" approach and can potentially mangle the author name.
+"""
+import argparse
+import json
+import re
+from sys import stderr, stdin, stdout
+
+
+def parse_authors(record: dict, authors_field: str, default_value: str,
+    index: int, abbr_authors_field: str = None) -> dict:
+    # Strip and normalize whitespace
+    new_authors = re.sub(r'\s+', ' ', record[authors_field])
+
+    if new_authors == "":
+        new_authors = default_value
+    else:
+        # Split authors list on comma/semicolon
+        # OR "and"/"&" with at least one space before and after
+        new_authors = re.split(r'(?:\s*[,，;；]\s*|\s+(?:and|&)\s+)', new_authors)[0]
+
+        # if it does not already end with " et al.", add it
+        if not new_authors.strip('. ').endswith(" et al"):
+            new_authors += ' et al'
+
+    if abbr_authors_field:
+        if record.get(abbr_authors_field):
+            print(
+                f"WARNING: the {abbr_authors_field!r} field already exists",
+                f"in record {index} and will be overwritten!",
+                file=stderr
+            )
+
+        record[abbr_authors_field] = new_authors
+    else:
+        record[authors_field] = new_authors
+
+    return record
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--authors-field", default="authors",
+        help="The field containing list of authors.")
+    parser.add_argument("--default-value", default="?",
+        help="Default value to use if authors list is empty.")
+    parser.add_argument("--abbr-authors-field",
+        help="The field for the generated abbreviated authors. " +
+             "If not provided, the original authors field will be modified.")
+
+    args = parser.parse_args()
+
+    for index, record in enumerate(stdin):
+        record = json.loads(record)
+
+        parse_authors(record, args.authors_field, args.default_value, index, args.abbr_authors_field)
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/transform-field-names b/ingest/vendored/transform-field-names
new file mode 100755
index 0000000..fde223f
--- /dev/null
+++ b/ingest/vendored/transform-field-names
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+"""
+Renames fields of the NDJSON record from stdin and outputs modified records
+to stdout.
+"""
+import argparse
+import json
+from sys import stderr, stdin, stdout
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--field-map", nargs="+",
+        help="Fields names in the NDJSON record mapped to new field names, " +
+             "formatted as '{old_field_name}={new_field_name}'. " +
+             "If the old field does not exist in record, the new field will be added with an empty string value." +
+             "If the new field already exists in record, then the renaming of the old field will be skipped.")
+    parser.add_argument("--force", action="store_true",
+        help="Force renaming of old field even if the new field already exists. " +
+             "Please keep in mind this will overwrite the value of the new field.")
+
+    args = parser.parse_args()
+
+    field_map = {}
+    for field in args.field_map:
+        old_name, new_name = field.split('=')
+        field_map[old_name] = new_name
+
+    for record in stdin:
+        record = json.loads(record)
+
+        for old_field, new_field in field_map.items():
+
+            if record.get(new_field) and not args.force:
+                print(
+                    f"WARNING: skipping rename of {old_field} because record",
+                    f"already has a field named {new_field}.",
+                    file=stderr
+                )
+                continue
+
+            record[new_field] = record.pop(old_field, '')
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/transform-genbank-location b/ingest/vendored/transform-genbank-location
new file mode 100755
index 0000000..70ba56f
--- /dev/null
+++ b/ingest/vendored/transform-genbank-location
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+"""
+Parses GenBank's 'location' field of the NDJSON record from stdin to 3 separate
+fields: 'country', 'division', and 'location'. Checks that a record is from
+GenBank by verifying that the 'database' field has a value of "GenBank" or "RefSeq".
+
+Outputs the modified record to stdout.
+"""
+import json
+from sys import stdin, stdout
+
+
+def parse_location(record: dict) -> dict:
+    # Expected pattern for the location field is "<country_value>[:<region>][, <locality>]"
+    # See GenBank docs for their "country" field:
+    # https://www.ncbi.nlm.nih.gov/genbank/collab/country/
+    geographic_data = record['location'].split(':')
+
+    country = geographic_data[0]
+    division = ''
+    location = ''
+
+    if len(geographic_data) == 2:
+        division , _ , location = geographic_data[1].partition(',')
+
+    record['country'] = country.strip()
+    record['division'] = division.strip()
+    record['location'] = location.strip()
+
+    return record
+
+
+if __name__ == '__main__':
+
+    for record in stdin:
+        record = json.loads(record)
+
+        database = record.get('database', '')
+        if database in {'GenBank', 'RefSeq'}:
+            parse_location(record)
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/transform-strain-names b/ingest/vendored/transform-strain-names
new file mode 100755
index 0000000..d86c0e4
--- /dev/null
+++ b/ingest/vendored/transform-strain-names
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+"""
+Verifies strain name pattern in the 'strain' field of the NDJSON record from
+stdin. Adds a 'strain' field to the record if it does not already exist.
+
+Outputs the modified records to stdout.
+"""
+import argparse
+import json
+import re
+from sys import stderr, stdin, stdout
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--strain-regex", default="^.+$",
+        help="Regex pattern for strain names. " +
+             "Strain names that do not match the pattern will be dropped.")
+    parser.add_argument("--backup-fields", nargs="*",
+        help="List of backup fields to use as strain name if the value in 'strain' " +
+             "does not match the strain regex pattern. " +
+             "If multiple fields are provided, will use the first field that has a non-empty string.")
+
+    args = parser.parse_args()
+
+    strain_name_pattern = re.compile(args.strain_regex)
+
+    for index, record in enumerate(stdin):
+        record = json.loads(record)
+
+        # Verify strain name matches the strain regex pattern
+        if strain_name_pattern.match(record.get('strain', '')) is None:
+            # Default to empty string if not matching pattern
+            record['strain'] = ''
+            # Use non-empty value of backup fields if provided
+            if args.backup_fields:
+                for field in args.backup_fields:
+                    if record.get(field):
+                        record['strain'] = str(record[field])
+                        break
+
+        if record['strain'] == '':
+            print(f"WARNING: Record number {index} has an empty string as the strain name.", file=stderr)
+
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/ingest/vendored/trigger b/ingest/vendored/trigger
new file mode 100755
index 0000000..586f9cc
--- /dev/null
+++ b/ingest/vendored/trigger
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${PAT_GITHUB_DISPATCH:=}"
+
+github_repo="${1:?A GitHub repository with owner and repository name is required as the first argument.}"
+event_type="${2:?An event type is required as the second argument.}"
+shift 2
+
+if [[ $# -eq 0 && -z $PAT_GITHUB_DISPATCH ]]; then
+    cat >&2 <<.
+You must specify options to curl for your GitHub credentials.  For example, you
+can specify your GitHub username, and will be prompted for your password:
+
+  $0 $github_repo $event_type --user <your-github-username>
+
+Be sure to enter a personal access token¹ as your password since GitHub has
+discontinued password authentication to the API starting on November 13, 2020².
+
+You can also store your credentials or a personal access token in a netrc
+file³:
+
+  machine api.github.com
+  login <your-username>
+  password <your-token>
+
+and then tell curl to use it:
+
+  $0 $github_repo $event_type --netrc
+
+which will then not require you to type your password every time.
+
+¹ https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line
+² https://docs.github.com/en/rest/overview/other-authentication-methods#via-username-and-password
+³ https://ec.haxx.se/usingcurl/usingcurl-netrc
+.
+    exit 1
+fi
+
+auth=':'
+if [[ -n $PAT_GITHUB_DISPATCH ]]; then
+  auth="Authorization: Bearer ${PAT_GITHUB_DISPATCH}"
+fi
+
+if curl -fsS "https://api.github.com/repos/${github_repo}/dispatches" \
+    -H 'Accept: application/vnd.github.v3+json' \
+    -H 'Content-Type: application/json' \
+    -H "$auth" \
+    -d '{"event_type":"'"$event_type"'"}' \
+    "$@"
+then
+    echo "Successfully triggered $event_type"
+else
+    echo "Request failed" >&2
+    exit 1
+fi
diff --git a/ingest/vendored/trigger-on-new-data b/ingest/vendored/trigger-on-new-data
new file mode 100755
index 0000000..470d2f4
--- /dev/null
+++ b/ingest/vendored/trigger-on-new-data
@@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"
+
+bin="$(dirname "$0")"
+
+github_repo="${1:?A GitHub repository with owner and repository name is required as the first argument.}"
+event_type="${2:?An event type is required as the second argument.}"
+metadata="${3:?A metadata upload output file is required as the third argument.}"
+sequences="${4:?An sequence FASTA upload output file is required as the fourth argument.}"
+identical_file_message="${5:-files are identical}"
+
+new_metadata=$(grep "$identical_file_message" "$metadata" >/dev/null; echo $?)
+new_sequences=$(grep "$identical_file_message" "$sequences" >/dev/null; echo $?)
+
+slack_message=""
+
+# grep exit status 0 for found match, 1 for no match, 2 if an error occurred
+if [[ $new_metadata -eq 1 || $new_sequences -eq 1 ]]; then
+    slack_message="Triggering new builds due to updated metadata and/or sequences"
+    "$bin"/trigger "$github_repo" "$event_type"
+elif [[ $new_metadata -eq 0 && $new_sequences -eq 0 ]]; then
+    slack_message="Skipping trigger of rebuild: Both metadata TSV and sequences FASTA are identical to S3 files."
+else
+    slack_message="Skipping trigger of rebuild: Unable to determine if data has been updated."
+fi
+
+
+if ! "$bin"/notify-slack "$slack_message"; then
+    echo "Notifying Slack failed, but exiting with success anyway."
+fi
diff --git a/ingest/vendored/upload-to-s3 b/ingest/vendored/upload-to-s3
new file mode 100755
index 0000000..36d171c
--- /dev/null
+++ b/ingest/vendored/upload-to-s3
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+bin="$(dirname "$0")"
+
+main() {
+    local quiet=0
+
+    for arg; do
+        case "$arg" in
+            --quiet)
+                quiet=1
+                shift;;
+            *)
+                break;;
+        esac
+    done
+
+    local src="${1:?A source file is required as the first argument.}"
+    local dst="${2:?A destination s3:// URL is required as the second argument.}"
+    local cloudfront_domain="${3:-}"
+
+    local s3path="${dst#s3://}"
+    local bucket="${s3path%%/*}"
+    local key="${s3path#*/}"
+
+    local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
+    src_hash="$("$bin/sha256sum" < "$src")"
+    dst_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
+
+    if [[ $src_hash != "$dst_hash" ]]; then
+        # The record count may have changed
+        src_record_count="$(wc -l < "$src")"
+
+        echo "Uploading $src → $dst"
+        if [[ "$dst" == *.gz ]]; then
+            gzip -c "$src"
+        elif  [[ "$dst" == *.xz ]]; then
+            xz -2 -T0 -c "$src"
+        elif [[ "$dst" == *.zst ]]; then
+            zstd -T0 -c "$src"
+        else
+            cat "$src"
+        fi | aws s3 cp --no-progress - "$dst" --metadata sha256sum="$src_hash",recordcount="$src_record_count" "$(content-type "$dst")"
+
+        if [[ -n $cloudfront_domain ]]; then
+            echo "Creating CloudFront invalidation for $cloudfront_domain/$key"
+            if ! "$bin"/cloudfront-invalidate "$cloudfront_domain" "/$key"; then
+                echo "CloudFront invalidation failed, but exiting with success anyway."
+            fi
+        fi
+
+        if [[ $quiet == 1 ]]; then
+            echo "Quiet mode. No Slack notification sent."
+            exit 0
+        fi
+
+        if ! "$bin"/notify-slack "Updated $dst available."; then
+            echo "Notifying Slack failed, but exiting with success anyway."
+        fi
+    else
+        echo "Uploading $src → $dst: files are identical, skipping upload"
+    fi
+}
+
+content-type() {
+    case "$1" in
+        *.tsv)      echo --content-type=text/tab-separated-values;;
+        *.csv)      echo --content-type=text/comma-separated-values;;
+        *.ndjson)   echo --content-type=application/x-ndjson;;
+        *.gz)       echo --content-type=application/gzip;;
+        *.xz)       echo --content-type=application/x-xz;;
+        *.zst)      echo --content-type=application/zstd;;
+        *)          echo --content-type=text/plain;;
+    esac
+}
+
+main "$@"

From 38007bb2c5fe6ed3ab00e6da05d8a138f36e18f3 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:33:07 -0800
Subject: [PATCH 03/28] Replace mpox text and taxon id with zika

---
 ingest/README.md                                        | 6 +++---
 ingest/config/config.yaml                               | 4 ++--
 ingest/config/optional.yaml                             | 2 +-
 ingest/workflow/snakemake_rules/slack_notifications.smk | 6 +++---
 ingest/workflow/snakemake_rules/trigger_rebuild.smk     | 6 +++---
 5 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/ingest/README.md b/ingest/README.md
index b7eb815..3d8c451 100644
--- a/ingest/README.md
+++ b/ingest/README.md
@@ -1,6 +1,6 @@
-# nextstrain.org/mpox/ingest
+# nextstrain.org/zika/ingest
 
-This is the ingest pipeline for mpox virus sequences.
+This is the ingest pipeline for zika virus sequences.
 
 ## Software requirements
 
@@ -9,7 +9,7 @@ Follow the [standard installation instructions](https://docs.nextstrain.org/en/l
 ## Usage
 
 > NOTE: All command examples assume you are within the `ingest` directory.
-> If running commands from the outer `mpox` directory, please replace the `.` with `ingest`
+> If running commands from the outer `zika` directory, please replace the `.` with `ingest`
 
 Fetch sequences with
 
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
index 8d18c5f..8ea03e6 100644
--- a/ingest/config/config.yaml
+++ b/ingest/config/config.yaml
@@ -1,7 +1,7 @@
 # Sources of sequences to include in the ingest run
 sources: ['genbank']
 # Pathogen NCBI Taxonomy ID
-ncbi_taxon_id: '10244'
+ncbi_taxon_id: '64320'
 # Renames the NCBI dataset headers
 ncbi_field_map: 'source-data/ncbi-dataset-field-map.tsv'
 
@@ -41,7 +41,7 @@ transform:
   abbr_authors_field: 'abbr_authors'
   # General geolocation rules to apply to geolocation fields
   geolocation_rules_url: 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv'
-  # Local geolocation rules that are only applicable to mpox data
+  # Local geolocation rules that are only applicable to zika data
   # Local rules can overwrite the general geolocation rules provided above
   local_geolocation_rules: 'source-data/geolocation-rules.tsv'
   # User annotations file
diff --git a/ingest/config/optional.yaml b/ingest/config/optional.yaml
index d445e07..9f00a7e 100644
--- a/ingest/config/optional.yaml
+++ b/ingest/config/optional.yaml
@@ -4,7 +4,7 @@ upload:
   # Upload params for AWS S3
   s3:
     # AWS S3 Bucket with prefix
-    dst: 's3://nextstrain-data/files/workflows/mpox'
+    dst: 's3://nextstrain-data/files/workflows/zika'
     # Mapping of files to upload, with key as remote file name and the value
     # the local file path relative to the ingest directory.
     files_to_upload:
diff --git a/ingest/workflow/snakemake_rules/slack_notifications.smk b/ingest/workflow/snakemake_rules/slack_notifications.smk
index 9eb0463..2b7ec61 100644
--- a/ingest/workflow/snakemake_rules/slack_notifications.smk
+++ b/ingest/workflow/snakemake_rules/slack_notifications.smk
@@ -18,7 +18,7 @@ if not slack_envvars_defined:
     )
     sys.exit(1)
 
-S3_SRC = "s3://nextstrain-data/files/workflows/mpox"
+S3_SRC = "s3://nextstrain-data/files/workflows/zika"
 
 
 rule notify_on_genbank_record_change:
@@ -48,8 +48,8 @@ rule notify_on_metadata_diff:
 
 
 onstart:
-    shell("./vendored/notify-on-job-start Ingest nextstrain/mpox")
+    shell("./vendored/notify-on-job-start Ingest nextstrain/zika")
 
 
 onerror:
-    shell("./vendored/notify-on-job-fail Ingest nextstrain/mpox")
+    shell("./vendored/notify-on-job-fail Ingest nextstrain/zika")
diff --git a/ingest/workflow/snakemake_rules/trigger_rebuild.smk b/ingest/workflow/snakemake_rules/trigger_rebuild.smk
index 2e797ee..0cf6731 100644
--- a/ingest/workflow/snakemake_rules/trigger_rebuild.smk
+++ b/ingest/workflow/snakemake_rules/trigger_rebuild.smk
@@ -1,5 +1,5 @@
 """
-This part of the workflow handles triggering new mpox builds after the
+This part of the workflow handles triggering new zika builds after the
 latest metadata TSV and sequence FASTA files have been uploaded to S3.
 
 Designed to be used internally by the Nextstrain team with hard-coded paths
@@ -9,7 +9,7 @@ to expected upload flag files.
 
 rule trigger_build:
     """
-    Triggering monekypox builds via repository action type `rebuild`.
+    Triggering zika builds via repository action type `rebuild`.
     """
     input:
         metadata_upload="data/upload/s3/metadata.tsv.gz.done",
@@ -18,5 +18,5 @@ rule trigger_build:
         touch("data/trigger/rebuild.done"),
     shell:
         """
-        ./vendored/trigger-on-new-data nextstrain/mpox rebuild {input.metadata_upload} {input.fasta_upload}
+        ./vendored/trigger-on-new-data nextstrain/zika rebuild {input.metadata_upload} {input.fasta_upload}
         """

From c475361cb87cbf413e6b29f617d71058104ff968 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:35:37 -0800
Subject: [PATCH 04/28] Remove Nextclade related rules

Removal of Nextclade related rules, pending the compilation of a Nextclade
zika dataset and potential v3 changes.
---
 ingest/Snakefile                              |  1 -
 ingest/config/config.yaml                     |  6 --
 ingest/source-data/nextclade-field-map.tsv    | 16 ----
 ingest/workflow/snakemake_rules/nextclade.smk | 86 -------------------
 ingest/workflow/snakemake_rules/transform.smk |  4 +-
 5 files changed, 2 insertions(+), 111 deletions(-)
 delete mode 100644 ingest/source-data/nextclade-field-map.tsv
 delete mode 100644 ingest/workflow/snakemake_rules/nextclade.smk

diff --git a/ingest/Snakefile b/ingest/Snakefile
index 0ed057b..bfc99b0 100644
--- a/ingest/Snakefile
+++ b/ingest/Snakefile
@@ -57,7 +57,6 @@ rule all:
 
 include: "workflow/snakemake_rules/fetch_sequences.smk"
 include: "workflow/snakemake_rules/transform.smk"
-include: "workflow/snakemake_rules/nextclade.smk"
 
 
 if config.get("upload", False):
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
index 8ea03e6..64dbab1 100644
--- a/ingest/config/config.yaml
+++ b/ingest/config/config.yaml
@@ -71,9 +71,3 @@ transform:
     'institution'
   ]
 
-# Params for Nextclade related rules
-nextclade:
-  # Field to use as the sequence ID in the Nextclade file
-  id_field: 'seqName'
-  # Fields from a Nextclade file to be renamed (if desired) and appended to a metadata file
-  field_map: 'source-data/nextclade-field-map.tsv'
diff --git a/ingest/source-data/nextclade-field-map.tsv b/ingest/source-data/nextclade-field-map.tsv
deleted file mode 100644
index a495da3..0000000
--- a/ingest/source-data/nextclade-field-map.tsv
+++ /dev/null
@@ -1,16 +0,0 @@
-key	value
-seqName	seqName
-clade	clade
-outbreak	outbreak
-lineage	lineage
-coverage	coverage
-totalMissing	missing_data
-totalSubstitutions	divergence
-totalNonACGTNs	nonACGTN
-qc.missingData.status	QC_missing_data
-qc.mixedSites.status	QC_mixed_sites
-qc.privateMutations.status	QC_rare_mutations
-qc.frameShifts.status	QC_frame_shifts
-qc.stopCodons.status	QC_stop_codons
-frameShifts	frame_shifts
-isReverseComplement	is_reverse_complement
\ No newline at end of file
diff --git a/ingest/workflow/snakemake_rules/nextclade.smk b/ingest/workflow/snakemake_rules/nextclade.smk
deleted file mode 100644
index f10a3f9..0000000
--- a/ingest/workflow/snakemake_rules/nextclade.smk
+++ /dev/null
@@ -1,86 +0,0 @@
-
-rule nextclade_dataset:
-    output:
-        temp("mpxv.zip"),
-    shell:
-        """
-        nextclade dataset get --name MPXV --output-zip {output}
-        """
-
-
-rule nextclade_dataset_hMPXV:
-    output:
-        temp("hmpxv.zip"),
-    shell:
-        """
-        nextclade dataset get --name hMPXV --output-zip {output}
-        """
-
-
-rule align:
-    input:
-        sequences="results/sequences.fasta",
-        dataset="hmpxv.zip",
-    output:
-        alignment="data/alignment.fasta",
-        insertions="data/insertions.csv",
-        translations="data/translations.zip",
-    params:
-        # The lambda is used to deactivate automatic wildcard expansion.
-        # https://github.com/snakemake/snakemake/blob/384d0066c512b0429719085f2cf886fdb97fd80a/snakemake/rules.py#L997-L1000
-        translations=lambda w: "data/translations/{gene}.fasta",
-    threads: 4
-    shell:
-        """
-        nextclade run -D {input.dataset} -j {threads}   --retry-reverse-complement \
-                  --output-fasta {output.alignment}  --output-translations {params.translations} \
-                  --output-insertions {output.insertions} {input.sequences}
-        zip -rj {output.translations} data/translations
-        """
-
-
-rule nextclade:
-    input:
-        sequences="results/sequences.fasta",
-        dataset="mpxv.zip",
-    output:
-        "data/nextclade.tsv",
-    threads: 4
-    shell:
-        """
-        nextclade run -D {input.dataset} -j {threads} --output-tsv {output}  {input.sequences}  --retry-reverse-complement
-        """
-
-
-rule join_metadata_clades:
-    input:
-        nextclade="data/nextclade.tsv",
-        metadata="data/metadata_raw.tsv",
-        nextclade_field_map=config["nextclade"]["field_map"],
-    output:
-        metadata="results/metadata.tsv",
-    params:
-        id_field=config["transform"]["id_field"],
-        nextclade_id_field=config["nextclade"]["id_field"],
-    shell:
-        """
-        export SUBSET_FIELDS=`awk 'NR>1 {{print $1}}' {input.nextclade_field_map} | tr '\n' ',' | sed 's/,$//g'`
-
-        csvtk -tl cut -f $SUBSET_FIELDS \
-            {input.nextclade} \
-        | csvtk -tl rename2 \
-            -F \
-            -f '*' \
-            -p '(.+)' \
-            -r '{{kv}}' \
-            -k {input.nextclade_field_map} \
-        | tsv-join -H \
-            --filter-file - \
-            --key-fields {params.nextclade_id_field} \
-            --data-fields {params.id_field} \
-            --append-fields '*' \
-            --write-all ? \
-            {input.metadata} \
-        | tsv-select -H --exclude {params.nextclade_id_field} \
-            > {output.metadata}
-        """
diff --git a/ingest/workflow/snakemake_rules/transform.smk b/ingest/workflow/snakemake_rules/transform.smk
index fe7d7c1..ec63d00 100644
--- a/ingest/workflow/snakemake_rules/transform.smk
+++ b/ingest/workflow/snakemake_rules/transform.smk
@@ -6,7 +6,7 @@ formats and expects input file
 
 This will produce output files as
 
-    metadata = "data/metadata_raw.tsv"
+    metadata = "results/metadata.tsv"
     sequences = "results/sequences.fasta"
 
 Parameters are expected to be defined in `config.transform`.
@@ -42,7 +42,7 @@ rule transform:
         all_geolocation_rules="data/all-geolocation-rules.tsv",
         annotations=config["transform"]["annotations"],
     output:
-        metadata="data/metadata_raw.tsv",
+        metadata="results/metadata.tsv",
         sequences="results/sequences.fasta",
     log:
         "logs/transform.txt",

From d154a880a39484122d74c7b8206315596468cca3 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:37:17 -0800
Subject: [PATCH 05/28] Clear pathogen specific user provided annotations and
 rules

---
 ingest/source-data/annotations.tsv       | 277 +----------------------
 ingest/source-data/geolocation-rules.tsv |  17 +-
 2 files changed, 2 insertions(+), 292 deletions(-)

diff --git a/ingest/source-data/annotations.tsv b/ingest/source-data/annotations.tsv
index e3c3ec1..8b13789 100644
--- a/ingest/source-data/annotations.tsv
+++ b/ingest/source-data/annotations.tsv
@@ -1,276 +1 @@
-AF380138	country	Democratic Republic of the Congo
-AY741551	country	Sierra Leone
-DQ011153	country	USA
-DQ011154	country	Republic of the Congo
-DQ011155	country	Democratic Republic of the Congo
-DQ011156	country	Liberia
-DQ011157	country	USA
-NC_003310	country	Democratic Republic of the Congo
-OR473631	country	France
-AF380138	region	Africa
-AY741551	region	Africa
-AY753185	region	Africa
-DQ011153	region	North America
-DQ011154	region	Africa
-DQ011155	region	Africa
-DQ011156	region	Africa
-DQ011157	region	North America
-NC_003310	region	Africa
-OR473631	region	Europe
-AY603973	date	1961-XX-XX
-AY741551	date	1970-XX-XX
-AY753185	date	1958-XX-XX
-DQ011153	date	2003-XX-XX
-DQ011156	date	1970-XX-XX
-DQ011157	date	2003-XX-XX
-LC722946	date	2022-07-XX
-MG693723	date	2017-XX-XX
-AF380138	date	1996-XX-XX
-DQ011154	date	2003-XX-XX
-DQ011155	date	1978-XX-XX
-HM172544	date	1979-XX-XX
-HQ857562	date	1979-XX-XX
-MT903337	date	2018-XX-XX
-MT903338	date	2018-XX-XX
-MT903339	date	2018-XX-XX
-MT903340	date	2018-XX-XX
-MT903341	date	2018-08-14
-MT903342	date	2019-04-30
-MT903342	date	2019-05-XX
-MT903343	date	2018-09-XX
-MT903344	date	2018-09-XX
-MT903345	date	2018-09-XX
-MT903346	date	2003-XX-XX
-MT903347	date	2003-XX-XX
-MT903348	date	2003-XX-XX
-ON782021	date	2022-05-24
-ON782022	date	2022-05-31
-ON880529	date	2022-05-28
-ON880533	date	2022-05-30
-ON880534	date	2022-05-30
-OP536786	date	2022-XX-XX
-OX044336	date	2022-XX-XX
-OX044337	date	2022-XX-XX
-OX044338	date	2022-XX-XX
-OX044339	date	2022-XX-XX
-OX044340	date	2022-XX-XX
-OX044341	date	2022-XX-XX
-OX044342	date	2022-XX-XX
-OX044343	date	2022-XX-XX
-OX044344	date	2022-XX-XX
-OX044345	date	2022-XX-XX
-OX044346	date	2022-XX-XX
-OX044347	date	2022-XX-XX
-OX044348	date	2022-XX-XX
-OX344864	date	2022-XX-XX
-OX344865	date	2022-XX-XX
-OX344866	date	2022-XX-XX
-OX344867	date	2022-XX-XX
-OX344868	date	2022-XX-XX
-OX344869	date	2022-XX-XX
-OX344870	date	2022-XX-XX
-OX344871	date	2022-XX-XX
-OX344872	date	2022-XX-XX
-OX344873	date	2022-XX-XX
-OX344874	date	2022-XX-XX
-OX344875	date	2022-XX-XX
-OX344876	date	2022-XX-XX
-OX344877	date	2022-XX-XX
-OX344878	date	2022-XX-XX
-OX344879	date	2022-XX-XX
-OX344880	date	2022-XX-XX
-OX344881	date	2022-XX-XX
-OX344882	date	2022-XX-XX
-OX344883	date	2022-XX-XX
-OX344884	date	2022-XX-XX
-OX344885	date	2022-XX-XX
-OX344886	date	2022-XX-XX
-OX344887	date	2022-XX-XX
-OX344888	date	2022-XX-XX
-OX344889	date	2022-XX-XX
-OX344890	date	2022-XX-XX
-AF380138	strain	Zaire-96-I-16
-DQ011154	strain	Congo_2003_358
-DQ011155	strain	Zaire_1979-005
-HM172544	strain	Zaire 1979-005
-HQ857562	strain	V79-I-005
-MT903338	strain	MPXV-M2957_Lagos
-AY603973	institution	Biochemistry & Microbiology, University of Victoria, Canada
-AY741551	institution	Biochemistry & Microbiology, University of Victoria, Canada
-AY753185	institution	Biochemistry & Microbiology, University of Victoria, Canada
-DQ011153	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
-DQ011154	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
-DQ011156	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
-DQ011157	institution	National Center for Infectious Diseases, Centers for Disease Control and Prevention (US CDC), USA
-FV537349	institution	Genetics Signatures, Sydney, Australia
-FV537350	institution	Genetics Signatures, Sydney, Australia
-FV537351	institution	Genetics Signatures, Sydney, Australia
-FV537352	institution	Genetics Signatures, Sydney, Australia
-HM172544	institution	Virology, The United States Army Medical Research Institute for Infectious Diseases (USAMRIID), USA
-HQ857562	institution	Vaccine and Gene Therapy Institute, Oregon Health and Science University, USA
-HQ857563	institution	Vaccine and Gene Therapy Institute, Oregon Health and Science University, USA
-JX878407	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878408	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878409	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878410	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878411	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878412	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878413	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878414	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878415	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878416	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878417	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878418	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878419	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878420	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878421	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878422	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878423	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878424	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878425	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878426	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878427	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878428	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-JX878429	institution	Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), USA
-KC257459	institution	Biochemistry & Microbiology, University of Victoria, Canada
-KC257460	institution	Biochemistry & Microbiology, University of Victoria, Canada
-KJ136820	institution	Centre for Biological Security, Robert Koch Institute, Germany
-KJ642612	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642613	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642614	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642615	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642616	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642617	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642618	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KJ642619	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KP849469	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KP849470	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-KP849471	institution	Poxvirus Program, Centres for Disease Control and prevention (US CDC), USA
-MK783028	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
-MK783029	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
-MK783030	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
-MK783031	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
-MK783032	institution	NCEZID/DHCPP/PRB, Centers for Disease Control & Prevention (US CDC), USA
-MN346690	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346692	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346693	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346694	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346695	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346696	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346698	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346699	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346700	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN346702	institution	Project Group Epidemiology of Highly Pathogenic Microorganisms, Robert Koch Institute, Germany
-MN648051	institution	Biochemistry & Molecular Genetics, Israel Institute for Biological Research, Israel
-MN702444	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702445	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702446	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702447	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702448	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702449	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702450	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702451	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702452	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MN702453	institution	Virology, Centre International de Recherches Medicales de Franceville, CIRMF, Gabon
-MT724769	institution	Biology, Universiteit Antwerpen, Belgium
-MT724770	institution	Biology, Universiteit Antwerpen, Belgium
-MT724771	institution	Biology, Universiteit Antwerpen, Belgium
-MT724772	institution	Biology, Universiteit Antwerpen, Belgium
-MT903337	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903338	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903339	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903340	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903341	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903342	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903343	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903344	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903345	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903346	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903347	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-MT903348	institution	National Center for Emerging and Zoonotic Infectious Diseases, Division of High Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-NC_003310	institution	Department of Molecular Biology of Genomes, SRC VB Vector, Russia
-ON563414	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON568298	institution	Microbiol Genomics & Bioinformatics, Bundeswehr Institute of Microbiology, Germany
-ON585029	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585030	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585031	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585032	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585033	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585034	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585035	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585036	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585037	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON585038	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON595760	institution	Laboratory of Viorology, University Hospital of Geneva, Switzerland
-ON602722	institution	IHAP, VIRAL, Universite de Toulouse, INRAE, ENVT, France
-ON609725	institution	Laboratory for Diagnostics of Zoonoses & WHO Centre, Institute of Microbiology & Immunology, Faculty of Medicine, University of Ljubljana, Slovenia
-ON614676	institution	Laboratory of Virology, INMI Lazzaro Spallanzani IRCCS, Italy
-ON615424	institution	Public Health Virology, Erasmus Medical Centre, The Netherlands
-ON619835	institution	Research & Evaluation, UKHSA, UK
-ON619836	institution	Research & Evaluation, UKHSA, UK
-ON619837	institution	Research & Evaluation, UKHSA, UK
-ON619838	institution	Research & Evaluation, UKHSA, UK
-ON622712	institution	Microbiology, Immunology & Transplantation, KU Leuven, Rega Institute, Belgium
-ON622713	institution	Microbiology, Immunology & Transplantation, KU Leuven, Rega Institute, Belgium
-ON622718	institution	Microbiology, Hospital Universitari Germans Trias i Pujol, Spain
-ON622720	institution	Laboratory of Viorology, University Hospital of Geneva, Switzerland
-ON622721	institution	Department of Biomedical & Clinical Sciences, University of Milan, Italy
-ON622722	institution	Virology, GENomique EPIdemiologique das maladies Infectieuses, France
-ON627808	institution	Department of Health, Utah Public Health Laboratory, USA
-ON631241	institution	Laboratory for Diagnostics of Zoonoses & WHO Centre, Institute of Microbiology & Immunology, Faculty of Medicine, University of Ljubljana, Slovenia
-ON631963	institution	Victorian Infectious Diseases Reference Laboratory, Doherty Institute, Australia
-ON637938	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON637939	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON644344	institution	Genomics & Epigenomics, AREA Science Park, Italy
-ON645312	institution	Centre for Clinical Infection & Diagnostics Research, Kings College London, St Thomas Hospital, UK
-ON649708	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649709	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649710	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649711	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649712	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649713	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649714	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649715	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649716	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649717	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649718	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649719	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649720	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649721	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649722	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649723	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649724	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649725	institution	Institute National de Saude Doutor Ricardo Jorge (INSA), Portugal
-ON649879	institution	Biochemistry & Molecular Genetics, Israel Institute for Biological Research, Israel
-ON674051	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON675438	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676703	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676704	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676705	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676706	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676707	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON676708	institution	Division of High-Consequence Pathogens & Pathology, Centers for Disease Control & Prevention (US CDC), USA
-ON682263	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682264	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682265	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682266	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682267	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682268	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682269	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON682270	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694329	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694330	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694331	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694332	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694333	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694334	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694335	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694336	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694337	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694338	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694339	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694340	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694341	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON694342	institution	Centre for Biological Threats, Highly Pathogenic Viruses, Robert Koch Institute, Germany
-ON720848	institution	Microbial Genomics, Hospital General Universitario Gregorio Marañón, Madrid, Spain
-ON720849	institution	Microbial Genomics, Hospital General Universitario Gregorio Marañón, Madrid, Spain
+
diff --git a/ingest/source-data/geolocation-rules.tsv b/ingest/source-data/geolocation-rules.tsv
index 2fc03f8..8b13789 100644
--- a/ingest/source-data/geolocation-rules.tsv
+++ b/ingest/source-data/geolocation-rules.tsv
@@ -1,16 +1 @@
-Africa/Cote d'Ivoire/*/*	Africa/Côte d'Ivoire/*/*
-Africa/Cote d'Ivoire/Tai National Park/*	Africa/Côte d'Ivoire/Bas-Sassandra/Tai National Park
-Africa/Democratic Republic of the Congo/Province Bandundu/*	Africa/Democratic Republic of the Congo/Bandundu/*
-Africa/Democratic Republic of the Congo/Province Equateur/*	Africa/Democratic Republic of the Congo/Équateur/*
-Africa/Democratic Republic of the Congo/Province Kasai Occidental/*	Africa/Democratic Republic of the Congo/Kasaï-Occidental/*
-Africa/Democratic Republic of the Congo/Province Kasai Oriental/*	Africa/Democratic Republic of the Congo/Kasaï-Oriental/*
-Africa/Democratic Republic of the Congo/Province P. Oriental/*	Africa/Democratic Republic of the Congo/Orientale/*
-Africa/Democratic Republic of the Congo/Yangdongi/	Africa/Democratic Republic of the Congo/Mongala/Yangdongi
-Africa/Democratic Republic of the Congo/Zaire/*	Africa/Democratic Republic of the Congo//
-Africa/Zaire/*/*	Africa/Democratic Republic of the Congo//
-*/Zaire/*/*	Africa/Democratic Republic of the Congo//
-Europe/France/Paris/*	Europe/France/Ile de France/Paris FR
-Europe/Italy/Fvg/Gorizia	Europe/Italy/Friuli Venezia Giulia/Gorizia
-# Unclear which location is the real location
-Europe/Netherlands/Utrecht/Rotterdam	Europe/Netherlands//
-North America/USA/Washington/Dc	North America/USA/Washington DC/
+

From 3dc55b971e9864cdbf9e69424ffba9963deebcaa Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 13:47:35 -0800
Subject: [PATCH 06/28] NCBI Dataset field name transformations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Organize field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics¹
   (see "ncbi_datasets_fields:" in `config/config.yaml`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names²
   (see "transform: fieldmap:" in `config/config.yaml`)

¹ https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
² https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
---
 ingest/bin/reverse_reversed_sequences.py      | 29 ----------
 ingest/config/config.yaml                     | 58 +++++++++++++++----
 ingest/source-data/ncbi-dataset-field-map.tsv | 17 ------
 .../snakemake_rules/fetch_sequences.smk       | 49 +++-------------
 4 files changed, 55 insertions(+), 98 deletions(-)
 delete mode 100644 ingest/bin/reverse_reversed_sequences.py
 delete mode 100644 ingest/source-data/ncbi-dataset-field-map.tsv

diff --git a/ingest/bin/reverse_reversed_sequences.py b/ingest/bin/reverse_reversed_sequences.py
deleted file mode 100644
index 6ca5ed2..0000000
--- a/ingest/bin/reverse_reversed_sequences.py
+++ /dev/null
@@ -1,29 +0,0 @@
-import pandas as pd
-import argparse
-from Bio import SeqIO
-
-if __name__=="__main__":
-    parser = argparse.ArgumentParser(
-        description="Reverse-complement reverse-complemented sequence",
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter
-    )
-
-    parser.add_argument('--metadata', type=str, required=True, help="input metadata")
-    parser.add_argument('--sequences', type=str, required=True, help="input sequences")
-    parser.add_argument('--output', type=str, required=True, help="output sequences")
-    args = parser.parse_args()
-
-    metadata = pd.read_csv(args.metadata, sep='\t')
-
-    # Read in fasta file
-    with open(args.sequences, 'r') as f_in:
-        with open(args.output, 'w') as f_out:
-            for seq in SeqIO.parse(f_in, 'fasta'):
-                # Check if metadata['reverse'] is True
-                if metadata.loc[metadata['accession'] == seq.id, 'reverse'].values[0] == True:
-                    # Reverse-complement sequence
-                    seq.seq = seq.seq.reverse_complement()
-                    print("Reverse-complementing sequence:", seq.id)
-
-                # Write sequences to file
-                SeqIO.write(seq, f_out, 'fasta')
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
index 64dbab1..30da477 100644
--- a/ingest/config/config.yaml
+++ b/ingest/config/config.yaml
@@ -2,22 +2,55 @@
 sources: ['genbank']
 # Pathogen NCBI Taxonomy ID
 ncbi_taxon_id: '64320'
-# Renames the NCBI dataset headers
-ncbi_field_map: 'source-data/ncbi-dataset-field-map.tsv'
+# The list of NCBI Datasets fields to include from NCBI Datasets output
+# These need to be the mneumonics of the NCBI Datasets fields, see docs for full list of fields
+# https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
+# Note: the "accession" field MUST be provided to match with the sequences
+ncbi_datasets_fields:
+  - accession
+  - sourcedb
+  - sra-accs
+  - isolate-lineage
+  - geo-region
+  - geo-location
+  - isolate-collection-date
+  - release-date
+  - update-date
+  - length
+  - host-name
+  - isolate-lineage-source
+  - biosample-acc
+  - submitter-names
+  - submitter-affiliation
+  - submitter-country
 
 # Params for the transform rule
 transform:
-  # Fields to rename.
+  # NCBI fields to rename to Nextstrain field names.
   # This is the first step in the pipeline, so any references to field names
   # in the configs below should use the new field names
-  field_map: ['collected=date', 'submitted=date_submitted', 'genbank_accession=accession', 'submitting_organization=institution']
+  field_map: [
+    'accession=genbank_accession',
+    'accession-rev=genbank_accession_rev',
+    'isolate-lineage=strain',
+    'sourcedb=database',
+    'geo-region=region',
+    'geo-location=location',
+    'host-name=host',
+    'isolate-collection-date=date',
+    'release-date=release_date',
+    'update-date=update_date',
+    'sra-accs=sra_accessions',
+    'submitter-names=authors',
+    'submitter-affiliations=institution',
+  ]
   # Standardized strain name regex
   # Currently accepts any characters because we do not have a clear standard for strain names
   strain_regex: '^.+$'
   # Back up strain name field if 'strain' doesn't match regex above
-  strain_backup_fields: ['accession']
+  strain_backup_fields: ['genbank_accession']
   # List of date fields to standardize
-  date_fields: ['date', 'date_submitted']
+  date_fields: ['date', 'release_date', 'update_date']
   # Expected date formats present in date fields
   # These date formats should use directives expected by datetime
   # See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
@@ -47,14 +80,14 @@ transform:
   # User annotations file
   annotations: 'source-data/annotations.tsv'
   # ID field used to merge annotations
-  annotations_id: 'accession'
+  annotations_id: 'genbank_accession'
   # Field to use as the sequence ID in the FASTA file
-  id_field: 'accession'
+  id_field: 'genbank_accession'
   # Field to use as the sequence in the FASTA file
   sequence_field: 'sequence'
   # Final output columns for the metadata TSV
   metadata_columns: [
-    'accession',
+    'genbank_accession',
     'genbank_accession_rev',
     'strain',
     'date',
@@ -62,11 +95,12 @@ transform:
     'country',
     'division',
     'location',
+    'length',
     'host',
-    'date_submitted',
-    'sra_accession',
+    'release_date',
+    'update_date',
+    'sra_accessions',
     'abbr_authors',
-    'reverse',
     'authors',
     'institution'
   ]
diff --git a/ingest/source-data/ncbi-dataset-field-map.tsv b/ingest/source-data/ncbi-dataset-field-map.tsv
deleted file mode 100644
index eb79418..0000000
--- a/ingest/source-data/ncbi-dataset-field-map.tsv
+++ /dev/null
@@ -1,17 +0,0 @@
-key	value
-Accession	genbank_accession_rev
-Source database	database
-Isolate Lineage	strain
-Geographic Region	region
-Geographic Location	location
-Isolate Collection date	collected
-Release date	submitted
-Update date	updated
-Length	length
-Host Name	host
-Isolate Lineage source	isolation_source
-BioProjects	bioproject_accession
-BioSample accession	biosample_accession
-SRA Accessions	sra_accession
-Submitter Names	authors
-Submitter Affiliation	submitting_organization
diff --git a/ingest/workflow/snakemake_rules/fetch_sequences.smk b/ingest/workflow/snakemake_rules/fetch_sequences.smk
index 3f32f9b..2fef4b1 100644
--- a/ingest/workflow/snakemake_rules/fetch_sequences.smk
+++ b/ingest/workflow/snakemake_rules/fetch_sequences.smk
@@ -44,57 +44,26 @@ rule extract_ncbi_dataset_sequences:
         """
 
 
-def _get_ncbi_dataset_field_mnemonics(wildcards) -> str:
-    """
-    Return list of NCBI Dataset report field mnemonics for fields that we want
-    to parse out of the dataset report. The column names in the output TSV
-    are different from the mnemonics.
-
-    See NCBI Dataset docs for full list of available fields and their column
-    names in the output:
-    https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
-    """
-    fields = [
-        "accession",
-        "sourcedb",
-        "isolate-lineage",
-        "geo-region",
-        "geo-location",
-        "isolate-collection-date",
-        "release-date",
-        "update-date",
-        "length",
-        "host-name",
-        "isolate-lineage-source",
-        "bioprojects",
-        "biosample-acc",
-        "sra-accs",
-        "submitter-names",
-        "submitter-affiliation",
-    ]
-    return ",".join(fields)
-
-
 rule format_ncbi_dataset_report:
-    # Formats the headers to be the same as before we used NCBI Datasets
-    # The only fields we do not have equivalents for are "title" and "publications"
+    # Formats the headers to match the NCBI mnemonic names
     input:
         dataset_package="data/ncbi_dataset.zip",
-        ncbi_field_map=config["ncbi_field_map"],
     output:
         ncbi_dataset_tsv=temp("data/ncbi_dataset_report.tsv"),
     params:
-        fields_to_include=_get_ncbi_dataset_field_mnemonics,
+        ncbi_datasets_fields=",".join(config["ncbi_datasets_fields"]),
     benchmark:
         "benchmarks/format_ncbi_dataset_report.txt"
     shell:
         """
         dataformat tsv virus-genome \
             --package {input.dataset_package} \
-            --fields {params.fields_to_include:q} \
-            | csvtk -tl rename2 -F -f '*' -p '(.+)' -r '{{kv}}' -k {input.ncbi_field_map} \
-            | csvtk -tl mutate -f genbank_accession_rev -n genbank_accession -p "^(.+?)\." \
-            | tsv-select -H -f genbank_accession --rest last \
+            --fields {params.ncbi_datasets_fields:q} \
+            --elide-header \
+            | csvtk add-header -t -n {params.ncbi_datasets_fields:q} \
+            | csvtk rename -t -f accession -n accession-rev \
+            | csvtk -tl mutate -f accession-rev -n accession -p "^(.+?)\." \
+            | tsv-select -H -f accession --rest last \
             > {output.ncbi_dataset_tsv}
         """
 
@@ -114,7 +83,7 @@ rule format_ncbi_datasets_ndjson:
         augur curate passthru \
             --metadata {input.ncbi_dataset_tsv} \
             --fasta {input.ncbi_dataset_sequences} \
-            --seq-id-column genbank_accession_rev \
+            --seq-id-column accession-rev \
             --seq-field sequence \
             --unmatched-reporting warn \
             --duplicate-reporting warn \

From 17e3912d4b4918944fbd80722f6a9f4cbcee5e1f Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 15:06:59 -0800
Subject: [PATCH 07/28] Rescue fauna data processing steps that are specific to
 Zika

Rescue some of the original functionality of the zika_upload script from fauna.
https://github.com/nextstrain/fauna/blob/master/vdb/zika_upload.py#L14-L30
---
 ingest/bin/post_process_metadata.py           |  57 +++++
 ingest/config/config.yaml                     |   3 +-
 ingest/source-data/annotations.tsv            | 238 +++++++++++++++++-
 ingest/workflow/snakemake_rules/transform.smk |   2 +
 4 files changed, 297 insertions(+), 3 deletions(-)
 create mode 100755 ingest/bin/post_process_metadata.py

diff --git a/ingest/bin/post_process_metadata.py b/ingest/bin/post_process_metadata.py
new file mode 100755
index 0000000..3c587e5
--- /dev/null
+++ b/ingest/bin/post_process_metadata.py
@@ -0,0 +1,57 @@
+#! /usr/bin/env python3
+
+import argparse
+import json
+from sys import stdin, stdout
+
+import re
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Reformat a NCBI Virus metadata.tsv file for a pathogen build."
+    )
+    parser.add_argument("--accession-field", default='accession',
+        help="Field from the records to use as the sequence ID in the FASTA file.")
+
+    return parser.parse_args()
+
+
+def _set_strain_name(record):
+    """Replace spaces, dashes, and periods with underscores in strain name."""
+    strain_name = record["strain"]
+    
+    strain_name = strain_name.replace('Zika_virus', '').replace('Zikavirus', '').replace('Zika virus', '').replace('Zika', '').replace('ZIKV', '')
+    strain_name = strain_name.replace('Human', '').replace('human', '').replace('H.sapiens_wt', '').replace('H.sapiens-wt', '').replace('H.sapiens_tc', '').replace('Hsapiens_tc', '').replace('H.sapiens-tc', '').replace('Homo_sapiens', '').replace('Homo sapiens', '').replace('Hsapiens', '').replace('H.sapiens', '')
+    strain_name = strain_name.replace('/Hu/', '')
+    strain_name = strain_name.replace('_Asian', '').replace('_Asia', '').replace('_asian', '').replace('_asia', '')
+    strain_name = strain_name.replace('_URI', '').replace('-URI', '').replace('_SER', '').replace('-SER', '').replace('_PLA', '').replace('-PLA', '').replace('_MOS', '').replace('_SAL', '')
+    strain_name = strain_name.replace('Aaegypti_wt', 'Aedes_aegypti').replace('Aedessp', 'Aedes_sp')
+    strain_name = strain_name.replace(' ', '').replace('\'', '').replace('(', '').replace(')', '').replace('//', '/').replace('__', '_').replace('.', '').replace(',', '')
+    strain_name = re.sub('^[\/\_\-]', '', strain_name)
+
+    try:
+        strain_name = 'V' + str(int(strain_name))
+    except ValueError:
+        pass
+
+    return (
+        strain_name.replace(" ", "_")
+        .replace("-", "_")
+        .replace(".", "_")
+        .replace("(", "_")
+        .replace(")", "_")
+    )
+
+
+def main():
+    args = parse_args()
+
+    for index, record in enumerate(stdin):
+        record = json.loads(record)
+        record["strain"] = _set_strain_name(record)
+        record["authors"] = record["abbr_authors"]
+        stdout.write(json.dumps(record) + "\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
index 30da477..927bd75 100644
--- a/ingest/config/config.yaml
+++ b/ingest/config/config.yaml
@@ -100,8 +100,7 @@ transform:
     'release_date',
     'update_date',
     'sra_accessions',
-    'abbr_authors',
     'authors',
-    'institution'
+    'institution',
   ]
 
diff --git a/ingest/source-data/annotations.tsv b/ingest/source-data/annotations.tsv
index 8b13789..502a3c1 100644
--- a/ingest/source-data/annotations.tsv
+++ b/ingest/source-data/annotations.tsv
@@ -1 +1,237 @@
-
+KX922703	strain	USA/2016/FL021
+KY765326	strain	NIC/6188_13A1/2016
+KX922707	strain	USA/2016/FL039
+KU922923	strain	MEX/InDRE/2016
+KY075934	strain	PuertoRico/2016/FL016U
+KY765327	strain	NIC/5005_13A1/2016
+KX922705	strain	USA/2016/FL032
+KY075938	strain	Aedes_aegypti/USA/2016/FL06
+KX922704	strain	USA/2016/FL030
+KX673530	strain	PHE_Guadeloupe
+KY075935	strain	USA/2016/FL022
+KX838906	strain	Aedes_aegypti/USA/2016/FL03
+KY075933	strain	PuertoRico/2016/FL008U
+KX838904	strain	Aedes_aegypti/USA/2016/FL01
+KX838905	strain	Aedes_aegypti/USA/2016/FL02
+KY765320	strain	NIC/6406_13A1/2016
+KY075936	strain	USA/2016/FL036
+KY075932	strain	Martinique/2016/FL001Sa
+KY765321	strain	NIC/4886_12A1/2016
+KY075939	strain	Aedes_aegypti/USA/2016/FL08
+KX922706	strain	USA/2016/FL038
+KY075937	strain	Aedes_aegypti/USA/2016/FL05
+KX922708	strain	Aedes_aegypti/USA/2016/FL04
+KY014295	strain	USA/2016/FL010
+MT377503	strain	V151144
+MF988734	strain	SG_EHI_/33164Y17
+KU853013	strain	Dominican_Republic/2016/PD2
+KY785443	strain	USA/2016/FL028
+KX906952	strain	2016_HND_19563
+KY120348	strain	MEX_CIENI551
+KX856011	strain	Aedes_sp/MEX_I_44/2016
+KY785421	strain	USA/2016/FL019
+KU527068	strain	Natal_RGN
+MF438286	strain	Cuba_2017
+KF993678	strain	THA/PLCal_ZV/2013
+KY631494	strain	ENCB165P4
+KY785440	strain	USA/2016/FL035
+KY785451	strain	Martinique/2016/FL001
+MF664436	strain	Dominican_Republic/2016/ZB
+KY648934	strain	Aedes_aegypti/MEX/MEX_I_44/2016
+KX879603	strain	EC/Esmeraldas/062/2016
+OL414716	strain	Faranah/18
+MN185326	strain	French_Guiana_Aedes_aegypti_T1010
+MN185328	strain	French_Guiana_Aedes_aegypti_T1141
+KX827268	strain	USA/UT_1/2016
+KU853012	strain	Dominican_Republic/2016/PD1
+MK028857	strain	Puerto_Rico/2015/PRVABC59
+KY785457	strain	USA/2016/FL029
+MH513600	strain	BR/Sinop/H366_2P/2015
+KY927808	strain	ZZ_1
+KX087102	strain	COL/FLR/2015
+KX879604	strain	EC/Esmeraldas/089/2016
+KF993678	country	Thailand
+KF993678	division	Thailand
+KF993678	location	Thailand
+KF993678	region	Southeast Asia
+KU647676	country	Martinique
+KU647676	division	Martinique
+KU647676	location	Martinique
+KU647676	region	North America
+KU740184	country	Venezuela
+KU740184	division	Venezuela
+KU740184	location	Venezuela
+KU740184	region	South America
+KU744693	country	Venezuela
+KU744693	division	Venezuela
+KU744693	location	Venezuela
+KU744693	region	South America
+KU758877	country	French Guiana
+KU758877	division	French Guiana
+KU758877	location	French Guiana
+KU758877	region	South America
+KU761560	country	American Samoa
+KU761560	division	American Samoa
+KU761560	location	American Samoa
+KU761560	region	Oceania
+KU761561	country	American Samoa
+KU761561	division	American Samoa
+KU761561	location	American Samoa
+KU761561	region	Oceania
+KU761564	country	Venezuela
+KU761564	division	Venezuela
+KU761564	location	Venezuela
+KU761564	region	South America
+KU820898	country	Venezuela
+KU820898	division	Venezuela
+KU820898	location	Venezuela
+KU820898	region	South America
+KU853012	country	Dominican Republic
+KU853012	division	Dominican Republic
+KU853012	location	Dominican Republic
+KU853012	region	North America
+KU866423	country	American Samoa
+KU866423	division	American Samoa
+KU866423	location	American Samoa
+KU866423	region	Oceania
+KU955589	country	American Samoa
+KU955589	division	American Samoa
+KU955589	location	American Samoa
+KU955589	region	Oceania
+KU955590	country	Venezuela
+KU955590	division	Venezuela
+KU955590	location	Venezuela
+KU955590	region	South America
+KU963796	country	American Samoa
+KU963796	division	American Samoa
+KU963796	location	American Samoa
+KU963796	region	Oceania
+KU991811	country	Brazil
+KU991811	division	Brazil
+KU991811	location	Brazil
+KU991811	region	South America
+KX056898	country	Venezuela
+KX056898	division	Venezuela
+KX056898	location	Venezuela
+KX056898	region	South America
+KX117076	country	American Samoa
+KX117076	division	American Samoa
+KX117076	location	American Samoa
+KX117076	region	Oceania
+KX185891	country	American Samoa
+KX185891	division	American Samoa
+KX185891	location	American Samoa
+KX185891	region	Oceania
+KX253996	country	American Samoa
+KX253996	division	American Samoa
+KX253996	location	American Samoa
+KX253996	region	Oceania
+KX266255	country	American Samoa
+KX266255	division	American Samoa
+KX266255	location	American Samoa
+KX266255	region	Oceania
+KX269878	country	Haiti
+KX269878	division	Haiti
+KX269878	location	Haiti
+KX269878	region	North America
+KX673530	country	Guadeloupe
+KX673530	division	Guadeloupe
+KX673530	location	Guadeloupe
+KX673530	region	North America
+KY120352	country	Brazil
+KY120352	division	Brazil
+KY120352	location	Brazil
+KY120352	region	South America
+KY120353	country	Philippines
+KY120353	division	Philippines
+KY120353	location	Philippines
+KY120353	region	Southeast Asia
+KY553111	country	Philippines
+KY553111	division	Philippines
+KY553111	location	Philippines
+KY553111	region	Southeast Asia
+KY785451	country	Martinique
+KY785451	division	Martinique
+KY785451	location	Martinique
+KY785451	region	North America
+KY785454	country	El Salvador
+KY785454	division	El Salvador
+KY785454	location	El Salvador
+KY785454	region	North America
+KY962729	country	Philippines
+KY962729	division	Philippines
+KY962729	location	Philippines
+KY962729	region	Southeast Asia
+LC191864	country	Fiji
+LC191864	division	Fiji
+LC191864	location	Fiji
+LC191864	region	Oceania
+LC219720	country	Vietnam
+LC219720	division	Vietnam
+LC219720	location	Vietnam
+LC219720	region	Southeast Asia
+LC369584	country	Thailand
+LC369584	division	Thailand
+LC369584	location	Thailand
+LC369584	region	Southeast Asia
+MF098764	country	Dominican Republic
+MF098764	division	Dominican Republic
+MF098764	location	Dominican Republic
+MF098764	region	North America
+MF098765	country	Dominican Republic
+MF098765	division	Dominican Republic
+MF098765	location	Dominican Republic
+MF098765	region	North America
+MF098766	country	Dominican Republic
+MF098766	division	Dominican Republic
+MF098766	location	Dominican Republic
+MF098766	region	North America
+MF098767	country	Saint Barthelemy
+MF098767	division	Saint Barthelemy
+MF098767	location	Saint Barthelemy
+MF098767	region	North America
+MF098768	country	Dominican Republic
+MF098768	division	Dominican Republic
+MF098768	location	Dominican Republic
+MF098768	region	North America
+MF098769	country	Dominican Republic
+MF098769	division	Dominican Republic
+MF098769	location	Dominican Republic
+MF098769	region	North America
+MF098770	country	Mexico
+MF098770	division	Mexico
+MF098770	location	Mexico
+MF098770	region	North America
+MF098771	country	Mexico
+MF098771	division	Mexico
+MF098771	location	Mexico
+MF098771	region	North America
+MF593625	country	Guatemala
+MF593625	division	Guatemala
+MF593625	location	Guatemala
+MF593625	region	North America
+MF664436	country	Dominican Republic
+MF664436	division	Dominican Republic
+MF664436	location	Dominican Republic
+MF664436	region	North America
+MF692778	country	Thailand
+MF692778	division	Thailand
+MF692778	location	Thailand
+MF692778	region	Southeast Asia
+MF988734	country	Cuba
+MF988734	division	Cuba
+MF988734	location	Cuba
+MF988734	region	North America
+MK829154	country	Angola
+MK829154	division	Angola
+MK829154	location	Angola
+MK829154	region	Africa
+MN185326	country	French Guiana
+MN185326	division	French Guiana
+MN185326	location	French Guiana
+MN185326	region	South America
+MN185328	country	French Guiana
+MN185328	division	French Guiana
+MN185328	location	French Guiana
+MN185328	region	South America
+KY328289	date	2016-05-15
\ No newline at end of file
diff --git a/ingest/workflow/snakemake_rules/transform.smk b/ingest/workflow/snakemake_rules/transform.smk
index ec63d00..a0891e5 100644
--- a/ingest/workflow/snakemake_rules/transform.smk
+++ b/ingest/workflow/snakemake_rules/transform.smk
@@ -85,6 +85,8 @@ rule transform:
                 --abbr-authors-field {params.abbr_authors_field} \
             | ./vendored/apply-geolocation-rules \
                 --geolocation-rules {input.all_geolocation_rules} \
+            | ./bin/post_process_metadata.py \
+                --accession-field {params.id_field} \
             | ./vendored/merge-user-metadata \
                 --annotations {input.annotations} \
                 --id-field {params.annotations_id} \

From 6bdee23adeec358f431415a9a33ffaae0e656482 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 15:19:26 -0800
Subject: [PATCH 08/28] Ignore snakemake state dir for current and subfolders

---
 .gitignore | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index b653f7d..1793bfc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -9,7 +9,9 @@ build/
 environment*
 
 # Snakemake state dir
-/.snakemake
+.snakemake/
+benchmarks/
+logs/
 
 # Local config overrides
 /config_local.yaml

From 41c902f8c4f044700d771762cddb33e73e65b2ca Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 13 Nov 2023 15:16:14 -0800
Subject: [PATCH 09/28] Use genbank_accession column as ID column

---
 Snakefile                        |  44 +++++++-
 config/config_zika.yaml          |   2 +
 config/dropped_strains.txt       | 167 ++++++++++++++++---------------
 example_data/metadata.tsv        |  70 ++++++-------
 example_data/sequences.fasta     |  68 ++++++-------
 scripts/set_final_strain_name.py |  38 +++++++
 6 files changed, 233 insertions(+), 156 deletions(-)
 create mode 100644 config/config_zika.yaml
 create mode 100644 scripts/set_final_strain_name.py

diff --git a/Snakefile b/Snakefile
index 3562078..15ff6a9 100644
--- a/Snakefile
+++ b/Snakefile
@@ -1,3 +1,6 @@
+if not config:
+    configfile: "config/config_zika.yaml"
+
 rule all:
     input:
         auspice_json = "auspice/zika.json",
@@ -59,12 +62,14 @@ rule filter:
         group_by = "country year month",
         sequences_per_group = 40,
         min_date = 2012,
-        min_length = 5385
+        min_length = 5385,
+        strain_id = config.get("strain_id_field", "strain"),
     shell:
         """
         augur filter \
             --sequences {input.sequences} \
             --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
             --exclude {input.exclude} \
             --output {output.sequences} \
             --group-by {params.group_by} \
@@ -124,13 +129,15 @@ rule refine:
     params:
         coalescent = "opt",
         date_inference = "marginal",
-        clock_filter_iqd = 4
+        clock_filter_iqd = 4,
+        strain_id = config.get("strain_id_field", "strain"),
     shell:
         """
         augur refine \
             --tree {input.tree} \
             --alignment {input.alignment} \
             --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
             --output-tree {output.tree} \
             --output-node-data {output.node_data} \
             --timetree \
@@ -187,12 +194,14 @@ rule traits:
         node_data = "results/traits.json",
     params:
         columns = "region country",
-        sampling_bias_correction = 3
+        sampling_bias_correction = 3,
+        strain_id = config.get("strain_id_field", "strain"),
     shell:
         """
         augur traits \
             --tree {input.tree} \
             --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
             --output {output.node_data} \
             --columns {params.columns} \
             --confidence \
@@ -212,12 +221,16 @@ rule export:
         auspice_config = files.auspice_config,
         description = files.description
     output:
-        auspice_json = rules.all.input.auspice_json
+        auspice_json = "results/raw_zika.json",
+        root_sequence = "results/raw_zika_root-sequence.json",
+    params:
+        strain_id = config.get("strain_id_field", "strain"),
     shell:
         """
         augur export v2 \
             --tree {input.tree} \
             --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
             --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
             --colors {input.colors} \
             --auspice-config {input.auspice_config} \
@@ -226,6 +239,29 @@ rule export:
             --output {output.auspice_json}
         """
 
+rule final_strain_name:
+    input:
+        auspice_json="results/raw_zika.json",
+        metadata="data/metadata.tsv",
+        root_sequence="results/raw_zika_root-sequence.json",
+    output:
+        auspice_json="auspice/zika.json",
+        root_sequence="auspice/zika_root-sequence.json",
+    params:
+        strain_id=config["strain_id_field"],
+        display_strain_field=config.get("display_strain_field", "strain"),
+    shell:
+        """
+        python3 scripts/set_final_strain_name.py \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --input-auspice-json {input.auspice_json} \
+            --display-strain-name {params.display_strain_field} \
+            --output {output.auspice_json}
+
+        cp {input.root_sequence} {output.root_sequence}
+        """
+
 rule clean:
     """Removing directories: {params}"""
     params:
diff --git a/config/config_zika.yaml b/config/config_zika.yaml
new file mode 100644
index 0000000..5345584
--- /dev/null
+++ b/config/config_zika.yaml
@@ -0,0 +1,2 @@
+strain_id_field: "genbank_accession"
+display_strain_field: "strain"
\ No newline at end of file
diff --git a/config/dropped_strains.txt b/config/dropped_strains.txt
index 22b5878..746e1ba 100644
--- a/config/dropped_strains.txt
+++ b/config/dropped_strains.txt
@@ -1,86 +1,87 @@
-PF13/251013_18 # reference included in config/zika_reference.gb
-AFMC_U # too basal
-AFMC_S # too basal
-Boracay/16423 # too basal
-JMB_185 # too basal
-PHL/2012/CPC_0740 # too basal
+MG827392
+KX369547 # PF13/251013_18 # reference included in config/zika_reference.gb
+KY553111 # AFMC_U # too basal
+KY962729 # AFMC_S # too basal
+KY120353 # Boracay/16423 # too basal
+KU179098 # JMB_185 # too basal
+KU681082 # PHL/2012/CPC_0740 # too basal
 VIE/Bra/2016 # too basal
-Dominican_Republic/2016/PD2 # duplicate of other strain in dataset
-GD01 # duplicate of other strain in dataset
-GDZ16001 # duplicate of other strain in dataset
-VEN/UF_2/2016 # duplicate of other strain in dataset
-ZZ_1 # duplicate of other strain in dataset
-VR10599/Pavia/2016 # export with unknown origin
-34997/Pavia/2016 # export with unknown origin
-COL/FLR_00001/2015 # duplicate of COL/FLR/2015
-COL/FLR_00002/2015 # duplicate of COL/FLR/2015
-COL/FLR_00003/2015 # duplicate of COL/FLR/2015
-COL/FLR_00004/2015 # duplicate of COL/FLR/2015
-COL/FLR_00005/2015 # duplicate of COL/FLR/2015
-COL/FLR_00006/2015 # duplicate of COL/FLR/2015
-COL/FLR_00007/2015 # duplicate of COL/FLR/2015
-COL/FLR_00008/2015 # duplicate of COL/FLR/2015
-COL/FLR_00009/2015 # duplicate of COL/FLR/2015
-COL/FLR_00010/2015 # duplicate of COL/FLR/2015
-COL/FLR_00011/2015 # duplicate of COL/FLR/2015
-COL/FLR_00012/2015 # duplicate of COL/FLR/2015
-COL/FLR_00013/2015 # duplicate of COL/FLR/2015
-COL/FLR_00014/2015 # duplicate of COL/FLR/2015
-COL/FLR_00015/2015 # duplicate of COL/FLR/2015
-COL/FLR_00016/2015 # duplicate of COL/FLR/2015
-COL/FLR_00017/2015 # duplicate of COL/FLR/2015
-COL/FLR_00018/2015 # duplicate of COL/FLR/2015
-COL/FLR_00019/2015 # duplicate of COL/FLR/2015
-COL/FLR_00020/2015 # duplicate of COL/FLR/2015
-COL/FLR_00021/2015 # duplicate of COL/FLR/2015
-COL/FLR_00022/2015 # duplicate of COL/FLR/2015
-COL/FLR_00023/2015 # duplicate of COL/FLR/2015
-COL/FLR_00024/2015 # duplicate of COL/FLR/2015
-COL/FLR_00025/2015 # duplicate of COL/FLR/2015
-COL/FLR_00026/2015 # duplicate of COL/FLR/2015
-COL/FLR_00034/2015 # duplicate of COL/FLR/2015
-COL/FLR_00035/2015 # duplicate of COL/FLR/2015
-COL/FLR_00036/2015 # duplicate of COL/FLR/2015
-COL/FLR_00038/2015 # duplicate of COL/FLR/2015
-COL/FLR_00040/2015 # duplicate of COL/FLR/2015
-COL/FLR_00041/2015 # duplicate of COL/FLR/2015
-COL/FLR_00042/2015 # duplicate of COL/FLR/2015
-COL/PRV_00027/2015 # misdated
-COL/PRV_00028/2015 # misdated
-COL/PAN_00029/2015 # misdated
-COL/PAN_00030/2015 # misdated
-BRA/2016/FC_DQ12D1 # large indel
-Brazil/2016/ZBRX8 # large indel
-Brazil/2016/ZBRX11 # large indel
-CX17 # large indel
-MEX/2016/mex27 # large indel
-MEX/2016/mex50 # large indel
-SLV/2016/ElSalvador_1055 # large indel
-USVI/20/2016 # large indel
+KU853013 # Dominican_Republic/2016/PD2 # duplicate of other strain in dataset
+KU740184 # GD01 # duplicate of other strain in dataset
+KU761564 # GDZ16001 # duplicate of other strain in dataset
+KX893855 # VEN/UF_2/2016 # duplicate of other strain in dataset
+KY927808 # ZZ_1 # duplicate of other strain in dataset
+KY003154 # VR10599/Pavia/2016 # export with unknown origin
+KY003153 # 34997/Pavia/2016 # export with unknown origin
+MF574552 # COL/FLR_00001/2015 # duplicate of COL/FLR/2015
+MF574559 # COL/FLR_00002/2015 # duplicate of COL/FLR/2015
+MF574560 # COL/FLR_00003/2015 # duplicate of COL/FLR/2015
+MF574561 # COL/FLR_00004/2015 # duplicate of COL/FLR/2015
+MF574571 # COL/FLR_00005/2015 # duplicate of COL/FLR/2015
+MF574555 # COL/FLR_00006/2015 # duplicate of COL/FLR/2015
+MF574557 # COL/FLR_00007/2015 # duplicate of COL/FLR/2015
+MF574562 # COL/FLR_00008/2015 # duplicate of COL/FLR/2015
+MF574572 # COL/FLR_00009/2015 # duplicate of COL/FLR/2015
+MF574570 # COL/FLR_00010/2015 # duplicate of COL/FLR/2015
+MF574565 # COL/FLR_00011/2015 # duplicate of COL/FLR/2015
+MF574568 # COL/FLR_00012/2015 # duplicate of COL/FLR/2015
+MF574558 # COL/FLR_00013/2015 # duplicate of COL/FLR/2015
+MF574576 # COL/FLR_00014/2015 # duplicate of COL/FLR/2015
+MF574567 # COL/FLR_00015/2015 # duplicate of COL/FLR/2015
+MF574575 # COL/FLR_00016/2015 # duplicate of COL/FLR/2015
+MF574553 # COL/FLR_00017/2015 # duplicate of COL/FLR/2015
+MF574573 # COL/FLR_00018/2015 # duplicate of COL/FLR/2015
+MF574574 # COL/FLR_00019/2015 # duplicate of COL/FLR/2015
+MF574577 # COL/FLR_00020/2015 # duplicate of COL/FLR/2015
+MF574556 # COL/FLR_00021/2015 # duplicate of COL/FLR/2015
+MF574554 # COL/FLR_00022/2015 # duplicate of COL/FLR/2015
+MF574566 # COL/FLR_00023/2015 # duplicate of COL/FLR/2015
+MF574569 # COL/FLR_00024/2015 # duplicate of COL/FLR/2015
+MF574563 # COL/FLR_00025/2015 # duplicate of COL/FLR/2015
+MF574564 # COL/FLR_00026/2015 # duplicate of COL/FLR/2015
+MF574581 # COL/FLR_00034/2015 # duplicate of COL/FLR/2015
+MF574588 # COL/FLR_00035/2015 # duplicate of COL/FLR/2015
+MF574582 # COL/FLR_00036/2015 # duplicate of COL/FLR/2015
+MF574586 # COL/FLR_00038/2015 # duplicate of COL/FLR/2015
+MF574584 # COL/FLR_00040/2015 # duplicate of COL/FLR/2015
+MF574583 # COL/FLR_00041/2015 # duplicate of COL/FLR/2015
+MF574580 # COL/FLR_00042/2015 # duplicate of COL/FLR/2015
+MF574579 # COL/PRV_00027/2015 # misdated
+MF574578 # COL/PRV_00028/2015 # misdated
+MF574585 # COL/PAN_00029/2015 # misdated
+MF574587 # COL/PAN_00030/2015 # misdated
+KY785436 # BRA/2016/FC_DQ12D1 # large indel
+KY559010 # Brazil/2016/ZBRX8 # large indel
+KY559011 # Brazil/2016/ZBRX11 # large indel
+KX986761 # CX17 # large indel
+MF801405 # MEX/2016/mex27 # large indel
+MF801424 # MEX/2016/mex50 # large indel
+MF801377 # SLV/2016/ElSalvador_1055 # large indel
+VI20_12plex # USVI/20/2016 # large indel
 USVI/21/2016 # large indel
-USVI/23/2016 # large indel
-USVI/27/2016 # large indel
-USVI/30/2016 # large indel
-USVI/32/2016 # large indel
-Thailand/1605aTw # excess divergence
-VE_Ganxian # excess divergence
-ZK_YN001 # excess divergence
-Haiti/0029/2014 # contamination present
-Haiti/0033/2014 # contamination present
-Haiti/0036/2014 # contamination present
-Haiti/0054/2014 # contamination present
-Haiti/0074/2014 # contamination present
-Haiti/0097/2014 # contamination present
-mosquito/Haiti/1682/2016 # contamination present
+VI23_12plex # USVI/23/2016 # large indel
+VI27_1d # USVI/27/2016 # large indel
+VI30_1d # USVI/30/2016 # large indel
+VI32_12plex # USVI/32/2016 # large indel
+KY126351 # Thailand/1605aTw # excess divergence
+KU744693 # VE_Ganxian # excess divergence
+KY328290 # ZK_YN001 # excess divergence
+KY415986 # Haiti/0029/2014 # contamination present
+KY415987 # Haiti/0033/2014 # contamination present
+KY415990 # Haiti/0036/2014 # contamination present
+KY415988 # Haiti/0054/2014 # contamination present
+KY415989 # Haiti/0074/2014 # contamination present
+KY415991 # Haiti/0097/2014 # contamination present
+MF384325 # mosquito/Haiti/1682/2016 # contamination present
 ZF36_36S # contamination present
-MR766 # lab strain
-Aedes_sp/MEX_I_44/2016 # duplicate of Aedes_aegypti/MEX/MEX_I_44/2016
-Puerto_Rico/2015/PRVABC59 # duplicate of PRVABC59
-V15555 # highly diverged
-DK # lab strain
-DK23 # lab strain
-rGZ02a/2018 # highly diverged
-rGZ02p/2018 # highly diverged
-V211784 # highly diverged
-LMM/AG5643
-Faranah/18
+MK105975 # MR766 # lab strain
+KX856011 # Aedes_sp/MEX_I_44/2016 # duplicate of Aedes_aegypti/MEX/MEX_I_44/2016
+MK028857 # Puerto_Rico/2015/PRVABC59 # duplicate of PRVABC59
+MN025403 # V15555 # highly diverged
+MT505349 # DK # lab strain
+MT505350 # DK23 # lab strain
+MW680969 # rGZ02a/2018 # highly diverged
+MW680970 # rGZ02p/2018 # highly diverged
+OK054351 # V211784 # highly diverged
+MT478034 # LMM/AG5643
+OL414716 # Faranah/18
diff --git a/example_data/metadata.tsv b/example_data/metadata.tsv
index 9c30f2e..6e5345c 100644
--- a/example_data/metadata.tsv
+++ b/example_data/metadata.tsv
@@ -1,35 +1,35 @@
-strain	virus	accession	date	region	country	division	city	db	segment	authors	url	title	journal	paper_url
-PAN/CDC_259359_V1_V3/2015	zika	KX156774	2015-12-18	North America	Panama	Panama	Panama	genbank	genome	Shabman et al	https://www.ncbi.nlm.nih.gov/nuccore/KX156774	Direct Submission	Submitted (29-APR-2016) J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA	https://www.ncbi.nlm.nih.gov/pubmed/
-COL/FLR_00024/2015	zika	MF574569	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574569	Direct Submission	Submitted (28-JUL-2017) J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA	https://www.ncbi.nlm.nih.gov/pubmed/
-PRVABC59	zika	KU501215	2015-12-XX	North America	Puerto Rico	Puerto Rico	Puerto Rico	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501215	Phylogeny of Zika Virus in Western Hemisphere, 2015	Emerging Infect. Dis. 22 (5), 933-935 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27088323
-COL/FLR_00008/2015	zika	MF574562	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574562	Direct Submission	Submitted (28-JUL-2017) J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA	https://www.ncbi.nlm.nih.gov/pubmed/
-Colombia/2016/ZC204Se	zika	KY317939	2016-01-06	South America	Colombia	Colombia	Colombia	genbank	genome	Quick et al	https://www.ncbi.nlm.nih.gov/nuccore/KY317939	Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples	Nat Protoc 12 (6), 1261-1276 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538739
-ZKC2/2016	zika	KX253996	2016-02-16	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Wu et al	https://www.ncbi.nlm.nih.gov/nuccore/KX253996	Direct Submission	Submitted (18-MAY-2016) Center for Diseases Control and Prevention of Guangdong Province; National Institute of Viral Disease Control and Prevention, China	https://www.ncbi.nlm.nih.gov/pubmed/
-VEN/UF_1/2016	zika	KX702400	2016-03-25	South America	Venezuela	Venezuela	Venezuela	genbank	genome	Blohm et al	https://www.ncbi.nlm.nih.gov/nuccore/KX702400	Complete Genome Sequences of Identical Zika virus Isolates in a Nursing Mother and Her Infant	Genome Announc 5 (17), e00231-17 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28450510
-DOM/2016/BB_0059	zika	KY785425	2016-04-04	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785425	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-BRA/2016/FC_6706	zika	KY785433	2016-04-08	South America	Brazil	Brazil	Brazil	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785433	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-DOM/2016/BB_0183	zika	KY785420	2016-04-18	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785420	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-EcEs062_16	zika	KX879603	2016-04-XX	South America	Ecuador	Ecuador	Ecuador	genbank	genome	Marquez et al	https://www.ncbi.nlm.nih.gov/nuccore/KX879603	First Complete Genome Sequences of Zika Virus Isolated from Febrile Patient Sera in Ecuador	Genome Announc 5 (8), e01673-16 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28232448
-HND/2016/HU_ME59	zika	KY785418	2016-05-13	North America	Honduras	Honduras	Honduras	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785418	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-DOM/2016/MA_WGS16_011	zika	KY785484	2016-06-06	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785484	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-DOM/2016/BB_0433	zika	KY785441	2016-06-13	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785441	Zika virus evolution and spread in the Americas	Nature 546 (7658), 411-415 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538734
-USA/2016/FL022	zika	KY075935	2016-07-22	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075935	Genomic epidemiology reveals multiple introductions of Zika virus into the United States	Nature (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/28538723
-SG_027	zika	KY241697	2016-08-27	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241697	Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis	Lancet Infect Dis (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/
-SG_074	zika	KY241744	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241744	Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis	Lancet Infect Dis (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/
-SG_056	zika	KY241726	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241726	Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis	Lancet Infect Dis (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/
-USA/2016/FLUR022	zika	KY325473	2016-08-31	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325473	Genomic epidemiology reveals multiple introductions of Zika virus into the United States	Nature (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/28538723
-Aedes_aegypti/USA/2016/FL05	zika	KY075937	2016-09-09	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075937	Genomic epidemiology reveals multiple introductions of Zika virus into the United States	Nature (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/28538723
-SG_018	zika	KY241688	2016-09-13	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241688	Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis	Lancet Infect Dis (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/
-USA/2016/FLWB042	zika	KY325478	2016-09-26	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325478	Genomic epidemiology reveals multiple introductions of Zika virus into the United States	Nature (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/28538723
-COL/PRV_00028/2015	zika	MF574578	2016-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574578	Direct Submission	Submitted (30-JUL-2017) J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA	https://www.ncbi.nlm.nih.gov/pubmed/
-Thailand/1610acTw	zika	MF692778	2016-10-XX	Southeast Asia	Thailand	Thailand	Thailand	genbank	genome	Lin et al	https://www.ncbi.nlm.nih.gov/nuccore/MF692778	Imported Zika virus strains, Taiwan, 2016	Unpublished	https://www.ncbi.nlm.nih.gov/pubmed/
-1_0087_PF	zika	KX447509	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447509	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
-1_0199_PF	zika	KX447519	2013-11-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447519	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
-1_0181_PF	zika	KX447512	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447512	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
-Brazil/2015/ZBRC301	zika	KY558995	2015-05-13	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558995	Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas	Unpublished	https://www.ncbi.nlm.nih.gov/pubmed/
-Brazil/2015/ZBRA105	zika	KY558989	2015-02-23	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558989	Establishment and cryptic transmission of Zika virus in Brazil and the Americas	Nature 546 (7658), 406-410 (2017)	https://www.ncbi.nlm.nih.gov/pubmed/28538727
-Brazil/2016/ZBRC16	zika	KY558991	2016-01-19	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558991	Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas	Unpublished	https://www.ncbi.nlm.nih.gov/pubmed/
-V8375	zika	KU501217	2015-11-01	North America	Guatemala	Guatemala	Guatemala	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501217	Phylogeny of Zika Virus in Western Hemisphere, 2015	Emerging Infect. Dis. 22 (5), 933-935 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27088323
-Nica1_16	zika	KX421195	2016-01-19	North America	Nicaragua	Nicaragua	Nicaragua	genbank	genome	Tabata et al	https://www.ncbi.nlm.nih.gov/nuccore/KX421195	Zika Virus Targets Different Primary Human Placental Cells, Suggesting Two Routes for Vertical Transmission	Cell Host Microbe 20 (2), 155-166 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27443522
-Brazil/2015/ZBRC303	zika	KY558997	2015-05-14	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558997	Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas	Unpublished	https://www.ncbi.nlm.nih.gov/pubmed/
-SMGC_1	zika	KX266255	2016-02-14	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Bi et al	https://www.ncbi.nlm.nih.gov/nuccore/KX266255	Genetic and Biological Characterization for Zika Viruses Imported through Shenzhen Port	Chin. Sci. Bull. 61 (22), 2463-2474 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/
+strain	virus	genbank_accession	date	region	country	division	city	db	segment	authors	url
+PAN/CDC_259359_V1_V3/2015	zika	KX156774	2015-12-18	North America	Panama	Panama	Panama	genbank	genome	Shabman et al	https://www.ncbi.nlm.nih.gov/nuccore/KX156774
+COL/FLR_00024/2015	zika	MF574569	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574569
+PRVABC59	zika	KU501215	2015-12-XX	North America	Puerto Rico	Puerto Rico	Puerto Rico	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501215
+COL/FLR_00008/2015	zika	MF574562	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574562
+Colombia/2016/ZC204Se	zika	KY317939	2016-01-06	South America	Colombia	Colombia	Colombia	genbank	genome	Quick et al	https://www.ncbi.nlm.nih.gov/nuccore/KY317939
+ZKC2/2016	zika	KX253996	2016-02-16	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Wu et al	https://www.ncbi.nlm.nih.gov/nuccore/KX253996
+VEN/UF_1/2016	zika	KX702400	2016-03-25	South America	Venezuela	Venezuela	Venezuela	genbank	genome	Blohm et al	https://www.ncbi.nlm.nih.gov/nuccore/KX702400
+DOM/2016/BB_0059	zika	KY785425	2016-04-04	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785425
+BRA/2016/FC_6706	zika	KY785433	2016-04-08	South America	Brazil	Brazil	Brazil	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785433
+DOM/2016/BB_0183	zika	KY785420	2016-04-18	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785420
+EcEs062_16	zika	KX879603	2016-04-XX	South America	Ecuador	Ecuador	Ecuador	genbank	genome	Marquez et al	https://www.ncbi.nlm.nih.gov/nuccore/KX879603
+HND/2016/HU_ME59	zika	KY785418	2016-05-13	North America	Honduras	Honduras	Honduras	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785418
+DOM/2016/MA_WGS16_011	zika	KY785484	2016-06-06	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785484
+DOM/2016/BB_0433	zika	KY785441	2016-06-13	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785441
+USA/2016/FL022	zika	KY075935	2016-07-22	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075935
+SG_027	zika	KY241697	2016-08-27	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241697
+SG_074	zika	KY241744	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241744
+SG_056	zika	KY241726	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241726
+USA/2016/FLUR022	zika	KY325473	2016-08-31	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325473
+Aedes_aegypti/USA/2016/FL05	zika	KY075937	2016-09-09	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075937
+SG_018	zika	KY241688	2016-09-13	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241688
+USA/2016/FLWB042	zika	KY325478	2016-09-26	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325478
+COL/PRV_00028/2015	zika	MF574578	2016-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574578
+Thailand/1610acTw	zika	MF692778	2016-10-XX	Southeast Asia	Thailand	Thailand	Thailand	genbank	genome	Lin et al	https://www.ncbi.nlm.nih.gov/nuccore/MF692778
+1_0087_PF	zika	KX447509	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447509
+1_0199_PF	zika	KX447519	2013-11-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447519
+1_0181_PF	zika	KX447512	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447512
+Brazil/2015/ZBRC301	zika	KY558995	2015-05-13	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558995
+Brazil/2015/ZBRA105	zika	KY558989	2015-02-23	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558989
+Brazil/2016/ZBRC16	zika	KY558991	2016-01-19	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558991
+V8375	zika	KU501217	2015-11-01	North America	Guatemala	Guatemala	Guatemala	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501217
+Nica1_16	zika	KX421195	2016-01-19	North America	Nicaragua	Nicaragua	Nicaragua	genbank	genome	Tabata et al	https://www.ncbi.nlm.nih.gov/nuccore/KX421195
+Brazil/2015/ZBRC303	zika	KY558997	2015-05-14	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558997
+SMGC_1	zika	KX266255	2016-02-14	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Bi et al	https://www.ncbi.nlm.nih.gov/nuccore/KX266255
diff --git a/example_data/sequences.fasta b/example_data/sequences.fasta
index 64facba..9203c90 100644
--- a/example_data/sequences.fasta
+++ b/example_data/sequences.fasta
@@ -1,4 +1,4 @@
->PAN/CDC_259359_V1_V3/2015
+>KX156774
 gaatttgaagcgaatgctaacaacagtatcaacaggttttattttggatttggaaacgag
 agtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgc
 taaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggac
@@ -179,7 +179,7 @@ gaccttccccacccttcaatctggggcctgaactggagatcagctgtggatctccagaag
 agggactagtggttagaggagaccccccggaaaacgcaaaacagcatattgacgctggga
 aagaccagagactccatgagtttccaccacgctggccgccaggcacagatcgccgaatag
 cggcggccggtgtggggaaatccatgggtct
->COL/FLR_00024/2015
+>MF574569
 tcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttatt
 ttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggatt
 ccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaa
@@ -358,7 +358,7 @@ agctgggaaaccaagcctatagtcaggccgagaacgccatggcacggaagaagccatgct
 gcctgtgagcccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgcagga
 tgggaaaagaaggtggcgaccttccccacccttcaatctggggcctgaactggagatcag
 ctgtggatctccagaagagggactagtggttagaggaga
->PRVABC59
+>KU501215
 gttgttgatctgtgtgaatcagactgcgacagttcgagtttgaagcgaaagctagcaaca
 gtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaaa
 aaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtgag
@@ -537,7 +537,7 @@ tgtgacccccccaggagaagctgggaaaccaagcctatagtcaggccgagaacgccatgg
 cacggaagaagccatgctgcctgtgagcccctcagaggacactgagtcaaaaaaccccac
 gcgcttggaggcgcaggatgggaaaagaaggtggcgaccttccccacccttcaatctggg
 gcctgaactggagatcagctgtggatctccagaagagggactagtggttagagga
->COL/FLR_00008/2015
+>MF574562
 tcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttatt
 ttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggatt
 ccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaa
@@ -716,7 +716,7 @@ agctgggaaaccaagcctatagtcaggccgagaacgccatggcacggaagaagccatgct
 gcctgtgagcccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgcagga
 tgggaaaagaaggtggcgaccttccccacccttcaatctggggcctgaactggagatcag
 ctgtggatctccagaagagggactagtggttagaggaga
->Colombia/2016/ZC204Se
+>KY317939
 gacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttggatttg
 gaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgt
 caatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgcc
@@ -894,7 +894,7 @@ agtcagccacagcttggggaaagctgtgcagcctgtgacccccccaggagaagctgggaa
 accaagcctatagtcaggccgagaacgccatggcacggaagaagccatgctgcctgtgag
 cccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaag
 aaggtggcgaccttccccacccttcaatctggggcctgaactggagat
->ZKC2/2016
+>KX253996
 agttgttgatctgtgtgaatcagactgcgacagttcgagtttgaagcgaaagctagcaac
 agtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaa
 aaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtga
@@ -1076,7 +1076,7 @@ ggcctgaactggagatcagctgtggatctccagaagagggactagtggttagaggagacc
 ccccggaaaacgcaaaacagcatattgacgctgggaaagaccagagactccatgagtttc
 caccacgctggccgccaggcacagatcgccgaatagcggcggccggtgtggggaaatcca
 tgggtct
->VEN/UF_1/2016
+>KX702400
 agttgttactgttgctgactcagactgcgacagttcgagtttgaagcgaaagctagcaac
 agtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaa
 aaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtga
@@ -1258,7 +1258,7 @@ ggcctgaactggagatcagctgtggatctccagaagagggactagtggttagaggagacc
 ccccggaaaacgcaaaacagcatattgacgctgggaaagaccagagactccatgagtttc
 caccacgctggccgccaggcacagatcgccgaatagcggcggccggtgtggggaaatcca
 tgggtctt
->DOM/2016/BB_0059
+>KY785425
 tggctgccatgctgagaataatcaatgctaggaaggagaagaagagacgaggcgcagata
 ctagtgtcggaattgttggcctcctgctgaccacagctatggcagcggaggtcactagac
 gtgggagtgcatactacatgtacttggacagaaacgatgctggggaggccatatctttcc
@@ -1427,7 +1427,7 @@ ggtgtggatctctcatagggcacagaccgcgcaccacctgggctgagaacattaaaaaca
 cagtcaacatggtgcgcaggatcataggtgaggaagaaaagtacatggactacctatcca
 cccaagttcgctacttgggtgaagaagggtctacacctggagtgctgtaagcaccaatct
 taatgttgtcaggcc
->BRA/2016/FC_6706
+>KY785433
 agtttgaagcgaaagctagcaacagtatcaacaggttttatttyggatttggaaacgaga
 gtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgct
 aaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggact
@@ -1601,7 +1601,7 @@ cattccctatttgggaaaaagggaagacttgtggtgtggatctctcatagggcacagacc
 gcgcaccacctgggctgagaacattaaaaacacagtcaacatggtgcgcaggatcatagg
 tgatgaagaaaagtacatggactacctatccacccaagttcgctacttgggtgaagaagg
 gtctacacctggagtgctgtaagcaccaatcttaatgttgtcaggc
->DOM/2016/BB_0183
+>KY785420
 gtttgaagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacgagag
 tttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgcta
 aaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggactt
@@ -1780,7 +1780,7 @@ tagtcaggccgagaacgccatggcacggaagaagccatgctgcctgtgagcccctcagag
 gacactgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaagaaggtggcga
 ccttccccacccttcaatctggggcctgaactggagatcagctgtggatccccagaagag
 g
->EcEs062_16
+>KX879603
 agtagttgatctgtgtgaatcagactgcgacagttcgagtttgaagcgaaagctagcaac
 agtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaa
 aaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtga
@@ -1962,7 +1962,7 @@ ggcctgaactggagatcagctgtggatctccagaagagggactagtggttagaggagacc
 ccccggaaaacgcaaaacagcatattgacgctgggaaagaccagagactccatgagtttc
 caccacgctggccgccaggcacagatcgccgaatagcggcggccggtgtggggaaatcca
 tgggagatcgga
->HND/2016/HU_ME59
+>KY785418
 gtttgaagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacgagag
 tttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgcta
 aaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggactt
@@ -2136,7 +2136,7 @@ attccctatttgggaaaaagggaagacttgtggtgtggatctctcatagggcacagaccg
 cgcaccacctgggctgagaacattaaaaacacagtcaacatggtgcgcaggatcataggt
 gatgaagaaaagtacatggactacctatccacccaagttcgctacttgggtgaagaaggg
 tctacacctggagtgctgtaagcaccaatcttaatgttgtcaggc
->DOM/2016/MA_WGS16_011
+>KY785484
 aagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacgagagtttct
 ggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgctaaaacg
 cggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttctgct
@@ -2314,7 +2314,7 @@ ggggaaagctgtgcagcctgtgacccccccaggagaagctgggaaaccaagcctatagtc
 aggccgagaacgccatggcacggaagaagccatgctgcctgtgagcccctcagaggacac
 tgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaagaaggtggcgaccttc
 cccacccttcaatctggggcctgaactggggatcag
->DOM/2016/BB_0433
+>KY785441
 tttgaagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacgagagt
 ttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgctaa
 aacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttc
@@ -2488,7 +2488,7 @@ ttccctatttgggaaaaagggaagacttgtggtgtggatctctcatagggcacagaccgc
 gcaccacctgggctgagaacattaaaaacacagtcaacatggtgcgcaggatcataggtg
 aggaagaaaagtacatggactacctatccacccaagttcgctacttgggtgaagaagggt
 ctacacctggagtgctgtaagcaccaatcctaatgttgtcaggcc
->USA/2016/FL022
+>KY075935
 gcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttggatt
 tggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggatt
 gtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctg
@@ -2662,7 +2662,7 @@ aaatggacagacattccctatttgggaaaaagggaagacttgtggtgtggatctctcata
 gggcacagaccgcgcaccacctgggctgagaacattaaaaacacagtcaacatggtgcgc
 aggatcataggtgaggaagaaaagtacatggactacctatccacccaagtccgctacttg
 ggtgaagaagggtctacacctggagtgctgtaagcaccaatctta
->SG_027
+>KY241697
 ctgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttgga
 tttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccgga
 ttgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggc
@@ -2842,7 +2842,7 @@ tgagcccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgcaggatggga
 aaagaaggtggcgaccttccccacccttcaatctggggcctgaactggagatcagctgtg
 gatctccagaagagggactagtggttagaggagaccccccggaaaacgcaaaacagcata
 ttgacgctgggaaagaccagagactccatgagtttccaccacgctggccgccag
->SG_074
+>KY241744
 gaatcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggtttt
 attttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggagg
 attccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggctt
@@ -3023,7 +3023,7 @@ ggatgggaaaagaaggtggcgaccttccccacccttcaatctggggcctgaactggagat
 cagctgtggatctccagaagagggactagtggttagaggagaccccccggaaaacgcaaa
 acagcatattgacgctgggaaagaccagagactccatgagtttccaccacgctggccgcc
 aggcacagatcgccgaatagcg
->SG_056
+>KY241726
 gaatcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggtttt
 attttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggagg
 attccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggctt
@@ -3203,7 +3203,7 @@ gctgcctgtgagcccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgca
 ggatgggaaaagaaggtggcgaccttccccacccttcaatctggggcctgaactggagat
 cagctgtggatctccagaagagggactagtggttagaggagaccccccggaaaacgcaaa
 acagcatattgacgctgggaaagaccagagactccatgagtttccaccacgctggcc
->USA/2016/FLUR022
+>KY325473
 gtgtgaatcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacagg
 ttttattttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccg
 gaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggg
@@ -3384,7 +3384,7 @@ cgcaggatgggaaaagaaggtggcgaccttccccacccttcaatctggggcctgaactgg
 agatcagctgtggatctccagaagagggactagtggttagaggagaccccccggaaaacg
 caaaacagcatattgacgctgggaaagaccagagactccatgagtttccaccacgctggc
 cgccaggcacagatcgccgaatagcggcggccggtgtggggaaatc
->Aedes_aegypti/USA/2016/FL05
+>KY075937
 gacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttggatttg
 gaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgt
 caatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgcc
@@ -3562,7 +3562,7 @@ agtcagccacagcttggggaaagctgtgcagcctgtgacccccccaggagaagctgggaa
 accaagcctatagtcaggccgagaacgccatggcacggaagaagccatgctgcctgtgag
 cccctcagaggacactgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaag
 aaggtggcgaccttccccacccttcaatctggggcctgaactggagat
->SG_018
+>KY241688
 atgnnnnnnnnnnnnnnnnnnnccggaggattccggattgtcaatatgctaaaacgcgga
 gtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttctgctgggt
 catgggcccatcaggatggtcttggcgattctagcctttttgaggttcacggcaatcaag
@@ -3741,7 +3741,7 @@ tcaaaaaaccccacgcgcttggaggcgcaggatgggaaaagaaggtggcgaccttcccca
 cccttcaatctggggcctgaactggagatcagctgtggatctccagaagagggactagtg
 gttagaggagaccccccggaaaacgcaaaacagcatattgacgctgggaaagaccagaga
 ctccatgagtttccaccacgctggccgccaggcacagat
->USA/2016/FLWB042
+>KY325478
 ctttgggggcttgaagaggctgccagccggacttctgctgggtcatgggcccatcaggat
 ggtcttggcgattctagcctttttgagattcacggcaatcaagccatcactgggtctcat
 caatagatggggttcagtggggaaaaaagaggctatggaaataataaagaagttcaagaa
@@ -3916,7 +3916,7 @@ aatctcaatgttgtcaggcctgctagtcagccacagcttggggaaagctgtgcagcctgt
 gacccccccaggagaagctgggaaaccaagcctatagtcaggccgagaacgccatggcac
 ggaagaagccatgctgcctgtgagcccctcagaggacactgagtcaaaaaaccccacgcg
 cttggaggcgcaggnnnnnnaaagaag
->COL/PRV_00028/2015
+>MF574578
 ttgaagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacgagagtt
 tctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgctaaa
 acgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttct
@@ -4095,7 +4095,7 @@ gtcaggccgagaacgccatggcacggaagaagccatgctgcctgtgagcccctcagagga
 cactgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaagaaggtggcgacc
 ttccccacccttcaatctggggcctgaactggagatcagctgtggatctccagaagaggg
 actagtggttagaggaga
->Thailand/1610acTw
+>MF692778
 gcaacagtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaa
 cccaaaaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccg
 tgtgagcccctttgggggcttgaagaggctgccagccggacttctgctgggccatgggcc
@@ -4271,7 +4271,7 @@ ggactacctatccacccaagttcgctacttgggtgaagaagggtctacacctggagtgct
 gtaagcaccaatcttagtgttgtcaggcctgctagtcagccacagcttggggaaagctgt
 gcagcctgtgacccccccaggagaagctgggaaaccaagcccatagtcaggccgagaacg
 ccatggcacggaag
->1_0087_PF
+>KX447509
 agtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaa
 aaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtga
 gcccctttgggggcttgaagaggctgccagccggacttctgctgggtcatgggcccatca
@@ -4449,7 +4449,7 @@ ctgtgacccccccaggagaagctgggaaaccaagcctatagtcaggccgagaacgccatg
 gcacggaagaagccatgctgcctgtgagcccctcagaggacactgagtcaaaaaacccca
 cgcgcttggaggcgcaggatgggaaaagaaggtggcgaccttccccacccttcaatctgg
 ggcctgaactggagatcagctgtggat
->1_0199_PF
+>KX447519
 actgcgacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttgg
 atttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccgg
 attgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagagg
@@ -4603,7 +4603,7 @@ tgggctctagtggacaaggaaagagagcaccacctgagaggagagtgccagagttgtgtg
 tacaacatgatgggaaaaagagaaaagaaacaaggggaatttggaaaggccaagggcagc
 cgcgccatctggtatatgtggctaggggctagatttctagagttcgaagcccttggattc
 ttgaacgaggatcactggatgg
->1_0181_PF
+>KX447512
 agtatcaacaggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaa
 aaaagaaatccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtga
 gcccctttgggggcttgaagaggctgccagccggacttctgctgggtcatgggcccatca
@@ -4781,7 +4781,7 @@ ctgtgacccccccaggagaagctgggaaaccaagcctatagtcaggccgagaacgccatg
 gcacggaagaagccatgctgcctgtgagcccctcagaggacactgagtcaaaaaacccca
 cgcgcttggaggcgcaggatgggaaaagaaggtggcgaccttccccacccttcaatctgg
 ggcctgaactggagatcagctgtgga
->Brazil/2015/ZBRC301
+>KY558995
 gatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccg
 gattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagag
 gctgccagccggacttctgctgggtcatgggcccatcaggatggtcttggcgattctagc
@@ -4950,7 +4950,7 @@ gactgcttgcctagcaaaatcatatgcgcagatgtggcagctcctttatttccacagaan
 ggacctccgactgatggccaatgccatttgttcatctgtgccagttgactgggttccaac
 tgggagaactacctggtcaatccatggaaanggagaatggatgaccactgaagacatgct
 tg
->Brazil/2015/ZBRA105
+>KY558989
 gatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccg
 gattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagag
 gctgccagccggacttctgctgggtcatgggcccatcaggatggtcttggcgattctagc
@@ -5119,7 +5119,7 @@ gactgcttgcctagcaaaatcatatgcgcaaatgtggcagctcctttatttccacagaag
 ggacctccgactgatggccaatgccatttgttcatctgtgccagttgactgggttccaac
 tgggagaactacctggtcaatccatggaaagggagaatggatgaccactgaagacatgct
 tg
->Brazil/2016/ZBRC16
+>KY558991
 tgagaataatcaatgctaggaaggagaagaagagacgaggcgcagatactagtgtcggaa
 ttgttggcctcctgctgaccacagctatggcagcggaggtcactagacgtgggagtgcat
 actatatgtacttggacagaaacgatgctggggaggccatatcttttccaaccacattgg
@@ -5272,7 +5272,7 @@ atgcagatgacactgctggctgggacacccgcatcagcaggtttgatctggagaatgaag
 ctctaatcaccaaccaaatggagagagggcacagggccttggcattggccataatcaagt
 acacataccaaaacaaagtggtaaaggtccttagaccagctgaaaaagggaaaacagtta
 tggacatcatttcgagacaagaccaaaggggg
->V8375
+>KU501217
 atgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgctaaaacgcgga
 gtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttctgctgggt
 catgggcccatcaggatggtcttggcgattctagcctttttgagattcacggcaatcaag
@@ -5445,7 +5445,7 @@ ttgggaaaaagggaagacttgtggtgtggatctctcatagggcacagaccgcgcaccacc
 tgggctgagaacattaaaaacacagtcaacatggtgcgcaggatcataggtgatgaagaa
 aagtacatggactacctatccacccaagttcgctacttgggtgaagaagggtctacacct
 ggagtgctgtaa
->Nica1_16
+>KX421195
 tcgagtttgaagcgaaagctagcaacagtatcaacaggttttattttggatttggaaacg
 agagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatat
 gctaaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccgg
@@ -5624,7 +5624,7 @@ cctatagtcaggccgagaacgccatggcacggaagaagccatgctgcctgtgagcccctc
 agaggacactgagtcaaaaaaccccacgcgcttggaggcgcaggatgggaaaagaaggtg
 gcgaccttccccacccttcaatctggggcctgaactggagatcagctgtggatctccaga
 agagggactagtggttagaggag
->Brazil/2015/ZBRC303
+>KY558997
 tgagaataatcaatgctaggaaggagaagaagagacgaggcacagatactagtgtcggaa
 ttgttggcctcctgctgaccacagctatggcagcggaggtcactagacgtgggagtgcat
 actatatgtacttggacagaaacgatgctggggaggccatatcttttccaaccacattgg
@@ -5782,7 +5782,7 @@ agatgcaagacttgtggctgctgcggaggtcagagaaagtgaccaactggttgcagagca
 acggatgggataggctcaaacgaatggcagtcagtggagatgattgcgttgtgaagccaa
 ttgatgataggtttgcacatgccctcaggttcttgaatgatatgggaaaagttaggaagg
 acacacaagagtgg
->SMGC_1
+>KX266255
 tctgtgtgaatcagactgcgacagttcgagtttgaagcgaaagctagcaacagtatcaac
 aggttttattttggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaat
 ccggaggattccggattgtcaatatgctaaaacgcggagtagcccgtgtgagcccctttg
diff --git a/scripts/set_final_strain_name.py b/scripts/set_final_strain_name.py
new file mode 100644
index 0000000..08ca935
--- /dev/null
+++ b/scripts/set_final_strain_name.py
@@ -0,0 +1,38 @@
+import pandas as pd
+import json, argparse
+from augur.io import read_metadata
+
+def replace_name_recursive(node, lookup):
+    if node["name"] in lookup:
+        node["name"] = lookup[node["name"]]
+
+    if "children" in node:
+        for child in node["children"]:
+            replace_name_recursive(child, lookup)
+
+if __name__=="__main__":
+    parser = argparse.ArgumentParser(
+        description="Swaps out the strain names in the Auspice JSON with the final strain name",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+
+    parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json")
+    parser.add_argument('--metadata', type=str, required=True, help="input data")
+    parser.add_argument('--metadata-id-columns', nargs="+", help="names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.")
+    parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice")
+    parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON")
+    args = parser.parse_args()
+
+    metadata = read_metadata(args.metadata, id_columns=args.metadata_id_columns)
+    name_lookup = {}
+    for ri, row in metadata.iterrows():
+        strain_id = row.name
+        name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]
+
+    with open(args.input_auspice_json, 'r') as fh:
+        data = json.load(fh)
+
+    replace_name_recursive(data['tree'], name_lookup)
+
+    with open(args.output, 'w') as fh:
+        json.dump(data, fh)

From 7b37112f4c100ce258c4d20d517c9daa2f3d3f2e Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Wed, 15 Nov 2023 12:25:03 -0800
Subject: [PATCH 10/28] workaround to get accession links to work

---
 scripts/set_final_strain_name.py | 34 ++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/scripts/set_final_strain_name.py b/scripts/set_final_strain_name.py
index 08ca935..c670f44 100644
--- a/scripts/set_final_strain_name.py
+++ b/scripts/set_final_strain_name.py
@@ -2,13 +2,22 @@
 import json, argparse
 from augur.io import read_metadata
 
-def replace_name_recursive(node, lookup):
+def replace_name_recursive(node, lookup, saveoldcolumn):
     if node["name"] in lookup:
+        if saveoldcolumn == "accession":
+            node["node_attrs"][saveoldcolumn] = node["name"]
+            node["node_attrs"]["url"] = "https://www.ncbi.nlm.nih.gov/nuccore/" + node["name"]
+        elif saveoldcolumn == "genbank_accession":
+            node["node_attrs"][saveoldcolumn] = {}
+            node["node_attrs"][saveoldcolumn]["value"] = node["name"]
+        else:
+            node["node_attrs"][saveoldcolumn] = node["name"]
+
         node["name"] = lookup[node["name"]]
 
     if "children" in node:
         for child in node["children"]:
-            replace_name_recursive(child, lookup)
+            replace_name_recursive(child, lookup, saveoldcolumn)
 
 if __name__=="__main__":
     parser = argparse.ArgumentParser(
@@ -24,15 +33,20 @@ def replace_name_recursive(node, lookup):
     args = parser.parse_args()
 
     metadata = read_metadata(args.metadata, id_columns=args.metadata_id_columns)
-    name_lookup = {}
-    for ri, row in metadata.iterrows():
-        strain_id = row.name
-        name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]
 
-    with open(args.input_auspice_json, 'r') as fh:
-        data = json.load(fh)
+    if args.display_strain_name in metadata.columns:
+        name_lookup = {}
+        for ri, row in metadata.iterrows():
+            strain_id = row.name
+            name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]
+
+        with open(args.input_auspice_json, 'r') as fh:
+            data = json.load(fh)
 
-    replace_name_recursive(data['tree'], name_lookup)
+        replace_name_recursive(data['tree'], name_lookup, args.metadata_id_columns[0])
+    else:
+        with open(args.input_auspice_json, 'r') as fh:
+            data = json.load(fh)
 
     with open(args.output, 'w') as fh:
-        json.dump(data, fh)
+        json.dump(data, fh, allow_nan=False, indent=None, separators=",:")

From 9071f0259f6e5d7652cde091382158c96a2fdc92 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Fri, 17 Nov 2023 10:18:45 -0800
Subject: [PATCH 11/28] Move phylogenetic workflow to a phylogenetic folder

Move phylogenetic workflow from top-level to folder phylogenetic in order
to follow the Pathogen Repo Template:

https://github.com/nextstrain/pathogen-repo-template
---
 .github/workflows/ci.yaml                     |  4 +-
 README.md                                     | 90 ++-----------------
 phylogenetic/README.md                        | 88 ++++++++++++++++++
 Snakefile => phylogenetic/Snakefile           |  0
 .../config}/auspice_config.json               |  0
 {config => phylogenetic/config}/colors.tsv    |  0
 .../config}/config_zika.yaml                  |  0
 .../config}/description.md                    |  0
 .../config}/dropped_strains.txt               |  0
 .../config}/zika_reference.gb                 |  0
 .../example_data}/metadata.tsv                |  0
 .../example_data}/sequences.fasta             |  0
 .../scripts}/check-countries-have-colors.sh   |  0
 .../scripts}/set_final_strain_name.py         |  0
 14 files changed, 98 insertions(+), 84 deletions(-)
 create mode 100644 phylogenetic/README.md
 rename Snakefile => phylogenetic/Snakefile (100%)
 rename {config => phylogenetic/config}/auspice_config.json (100%)
 rename {config => phylogenetic/config}/colors.tsv (100%)
 rename {config => phylogenetic/config}/config_zika.yaml (100%)
 rename {config => phylogenetic/config}/description.md (100%)
 rename {config => phylogenetic/config}/dropped_strains.txt (100%)
 rename {config => phylogenetic/config}/zika_reference.gb (100%)
 rename {example_data => phylogenetic/example_data}/metadata.tsv (100%)
 rename {example_data => phylogenetic/example_data}/sequences.fasta (100%)
 rename {scripts => phylogenetic/scripts}/check-countries-have-colors.sh (100%)
 rename {scripts => phylogenetic/scripts}/set_final_strain_name.py (100%)

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index b1f5bca..6fa98cd 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -6,4 +6,6 @@ on:
 
 jobs:
   ci:
-    uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@master
+    uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@dec0880059017dac7facf100435c5737bf1386c8
+    with:
+      workflow-root: phylogenetic
diff --git a/README.md b/README.md
index 568bb03..ee1a6e6 100644
--- a/README.md
+++ b/README.md
@@ -1,88 +1,12 @@
-# nextstrain.org/zika
+# Nextstrain repository for Zika virus
 
-This is the [Nextstrain](https://nextstrain.org) build for Zika, visible at
-[nextstrain.org/zika](https://nextstrain.org/zika).
+This repository contains two workflows for the analysis of Zika virus data:
 
-The build encompasses fetching data, preparing it for analysis, doing quality
-control, performing analyses, and saving the results in a format suitable for
-visualization (with [auspice][]).  This involves running components of
-Nextstrain such as [fauna][] and [augur][].
+- [`ingest/`](./ingest) - Download data from GenBank, clean and curate it and upload it to S3
+- [`phylogenetic/`](./phylogenetic) - Make phylogenetic trees for nextstrain.org
 
-All Zika-specific steps and functionality for the Nextstrain pipeline should be
-housed in this repository.
+Each folder contains a README.md with more information.
 
-_This build requires Augur v6._
+## Documentation
 
-[![Build Status](https://github.com/nextstrain/zika/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/nextstrain/zika/actions/workflows/ci.yaml)
-
-## Usage
-
-If you're unfamiliar with Nextstrain builds, you may want to follow our
-[quickstart guide][] first and then come back here.
-
-There are two main ways to run & visualise the output from this build:
-
-The first, and easiest, way to run this pathogen build is using the [Nextstrain
-command-line tool][nextstrain-cli]:
-```
-nextstrain build . 
-nextstrain view auspice/
-```
-
-See the [nextstrain-cli README][] for how to install the `nextstrain` command.
-
-The second is to install augur & auspice using conda, following [these instructions](https://nextstrain.org/docs/getting-started/local-installation#install-augur--auspice-with-conda-recommended).
-The build may then be run via:
-```
-snakemake
-auspice --datasetDir auspice/
-```
-
-Build output goes into the directories `data/`, `results/` and `auspice/`.
-
-## Configuration
-
-Configuration takes place entirely with the `Snakefile`. This can be read top-to-bottom, each rule
-specifies its file inputs and output and also its parameters. There is little redirection and each
-rule should be able to be reasoned with on its own.
-
-
-## Input data
-
-This build starts by downloading sequences from
-https://data.nextstrain.org/files/zika/sequences.fasta.xz
-and metadata from
-https://data.nextstrain.org/files/zika/metadata.tsv.gz.
-These are publicly provisioned data by the Nextstrain team by pulling sequences
-from NCBI GenBank via ViPR and performing 
-[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).
-
-Data from GenBank follows Open Data principles, such that we can make input data
-and intermediate files available for further analysis. Open Data is data that
-can be freely used, re-used and redistributed by anyone - subject only, at most,
-to the requirement to attribute and sharealike.
-
-We gratefully acknowledge the authors, originating and submitting laboratories
-of the genetic sequences and metadata for sharing their work in open databases.
-Please note that although data generators have generously shared data in an open
-fashion, that does not mean there should be free license to publish on this
-data. Data generators should be cited where possible and collaborations should
-be sought in some circumstances. Please try to avoid scooping someone else's
-work. Reach out if uncertain. Authors, paper references (where available) and
-links to GenBank entries are provided in the metadata file.
-
-A faster build process can be run working from example data by copying over
-sequences and metadata from `example_data/` to `data/` via:
-```
-mkdir -p data/
-cp -v example_data/* data/
-```
-
-[Nextstrain]: https://nextstrain.org
-[fauna]: https://github.com/nextstrain/fauna
-[augur]: https://github.com/nextstrain/augur
-[auspice]: https://github.com/nextstrain/auspice
-[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options
-[nextstrain-cli]: https://github.com/nextstrain/cli
-[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md
-[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart
+- [Contributor documentation](./CONTRIBUTING.md)
diff --git a/phylogenetic/README.md b/phylogenetic/README.md
new file mode 100644
index 0000000..568bb03
--- /dev/null
+++ b/phylogenetic/README.md
@@ -0,0 +1,88 @@
+# nextstrain.org/zika
+
+This is the [Nextstrain](https://nextstrain.org) build for Zika, visible at
+[nextstrain.org/zika](https://nextstrain.org/zika).
+
+The build encompasses fetching data, preparing it for analysis, doing quality
+control, performing analyses, and saving the results in a format suitable for
+visualization (with [auspice][]).  This involves running components of
+Nextstrain such as [fauna][] and [augur][].
+
+All Zika-specific steps and functionality for the Nextstrain pipeline should be
+housed in this repository.
+
+_This build requires Augur v6._
+
+[![Build Status](https://github.com/nextstrain/zika/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/nextstrain/zika/actions/workflows/ci.yaml)
+
+## Usage
+
+If you're unfamiliar with Nextstrain builds, you may want to follow our
+[quickstart guide][] first and then come back here.
+
+There are two main ways to run & visualise the output from this build:
+
+The first, and easiest, way to run this pathogen build is using the [Nextstrain
+command-line tool][nextstrain-cli]:
+```
+nextstrain build . 
+nextstrain view auspice/
+```
+
+See the [nextstrain-cli README][] for how to install the `nextstrain` command.
+
+The second is to install augur & auspice using conda, following [these instructions](https://nextstrain.org/docs/getting-started/local-installation#install-augur--auspice-with-conda-recommended).
+The build may then be run via:
+```
+snakemake
+auspice --datasetDir auspice/
+```
+
+Build output goes into the directories `data/`, `results/` and `auspice/`.
+
+## Configuration
+
+Configuration takes place entirely with the `Snakefile`. This can be read top-to-bottom, each rule
+specifies its file inputs and output and also its parameters. There is little redirection and each
+rule should be able to be reasoned with on its own.
+
+
+## Input data
+
+This build starts by downloading sequences from
+https://data.nextstrain.org/files/zika/sequences.fasta.xz
+and metadata from
+https://data.nextstrain.org/files/zika/metadata.tsv.gz.
+These are publicly provisioned data by the Nextstrain team by pulling sequences
+from NCBI GenBank via ViPR and performing 
+[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).
+
+Data from GenBank follows Open Data principles, such that we can make input data
+and intermediate files available for further analysis. Open Data is data that
+can be freely used, re-used and redistributed by anyone - subject only, at most,
+to the requirement to attribute and sharealike.
+
+We gratefully acknowledge the authors, originating and submitting laboratories
+of the genetic sequences and metadata for sharing their work in open databases.
+Please note that although data generators have generously shared data in an open
+fashion, that does not mean there should be free license to publish on this
+data. Data generators should be cited where possible and collaborations should
+be sought in some circumstances. Please try to avoid scooping someone else's
+work. Reach out if uncertain. Authors, paper references (where available) and
+links to GenBank entries are provided in the metadata file.
+
+A faster build process can be run working from example data by copying over
+sequences and metadata from `example_data/` to `data/` via:
+```
+mkdir -p data/
+cp -v example_data/* data/
+```
+
+[Nextstrain]: https://nextstrain.org
+[fauna]: https://github.com/nextstrain/fauna
+[augur]: https://github.com/nextstrain/augur
+[auspice]: https://github.com/nextstrain/auspice
+[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options
+[nextstrain-cli]: https://github.com/nextstrain/cli
+[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md
+[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart
diff --git a/Snakefile b/phylogenetic/Snakefile
similarity index 100%
rename from Snakefile
rename to phylogenetic/Snakefile
diff --git a/config/auspice_config.json b/phylogenetic/config/auspice_config.json
similarity index 100%
rename from config/auspice_config.json
rename to phylogenetic/config/auspice_config.json
diff --git a/config/colors.tsv b/phylogenetic/config/colors.tsv
similarity index 100%
rename from config/colors.tsv
rename to phylogenetic/config/colors.tsv
diff --git a/config/config_zika.yaml b/phylogenetic/config/config_zika.yaml
similarity index 100%
rename from config/config_zika.yaml
rename to phylogenetic/config/config_zika.yaml
diff --git a/config/description.md b/phylogenetic/config/description.md
similarity index 100%
rename from config/description.md
rename to phylogenetic/config/description.md
diff --git a/config/dropped_strains.txt b/phylogenetic/config/dropped_strains.txt
similarity index 100%
rename from config/dropped_strains.txt
rename to phylogenetic/config/dropped_strains.txt
diff --git a/config/zika_reference.gb b/phylogenetic/config/zika_reference.gb
similarity index 100%
rename from config/zika_reference.gb
rename to phylogenetic/config/zika_reference.gb
diff --git a/example_data/metadata.tsv b/phylogenetic/example_data/metadata.tsv
similarity index 100%
rename from example_data/metadata.tsv
rename to phylogenetic/example_data/metadata.tsv
diff --git a/example_data/sequences.fasta b/phylogenetic/example_data/sequences.fasta
similarity index 100%
rename from example_data/sequences.fasta
rename to phylogenetic/example_data/sequences.fasta
diff --git a/scripts/check-countries-have-colors.sh b/phylogenetic/scripts/check-countries-have-colors.sh
similarity index 100%
rename from scripts/check-countries-have-colors.sh
rename to phylogenetic/scripts/check-countries-have-colors.sh
diff --git a/scripts/set_final_strain_name.py b/phylogenetic/scripts/set_final_strain_name.py
similarity index 100%
rename from scripts/set_final_strain_name.py
rename to phylogenetic/scripts/set_final_strain_name.py

From acd7605827213284dc319205ac2c0c988b16ebf1 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Tue, 21 Nov 2023 13:32:25 -0800
Subject: [PATCH 12/28] Add rules for merging USVI data with NCBI GenBank
 ingested data.

The original Zika build contained USVI data that had been posted publiclly to
GitHub but not yet submitted to NCBI GenBank. This commit adds rules to merge
the USVI data with the NCBI GenBank data.

Since USVI does not have a genbank_accession column, we create a new accession column
for both USVI and NCBI GenBank accessions. This accession column is then used
as the strain_id column for the phylogenetic build.

Since auspice automagically generates a NCBI GenBank url for "genbank_accession" fields,
we use a "url" field instead, allowing for a mix of GenBank and GitHub urls
to be used in the strain popup window.
---
 phylogenetic/Snakefile                        |  14 +-
 phylogenetic/config/config_zika.yaml          |   2 +-
 phylogenetic/example_data/metadata.tsv        |  70 ++++-----
 phylogenetic/example_data/metadata_usvi.tsv   |   2 +
 .../example_data/sequences_usvi.fasta         | 137 ++++++++++++++++++
 phylogenetic/rules/usvi.smk                   |  52 +++++++
 phylogenetic/scripts/set_final_strain_name.py |   1 -
 7 files changed, 235 insertions(+), 43 deletions(-)
 create mode 100644 phylogenetic/example_data/metadata_usvi.tsv
 create mode 100644 phylogenetic/example_data/sequences_usvi.fasta
 create mode 100644 phylogenetic/rules/usvi.smk

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 15ff6a9..ca29a68 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -16,6 +16,8 @@ rule files:
 
 files = rules.files.params
 
+include: "rules/usvi.smk"
+
 rule download:
     """Downloading sequences and metadata from data.nextstrain.org"""
     output:
@@ -53,8 +55,8 @@ rule filter:
       - minimum genome length of {params.min_length} (50% of Zika virus genome)
     """
     input:
-        sequences = "data/sequences.fasta",
-        metadata = "data/metadata.tsv",
+        sequences = "data/sequences_all.fasta",
+        metadata = "data/metadata_all.tsv",
         exclude = files.dropped_strains
     output:
         sequences = "results/filtered.fasta"
@@ -122,7 +124,7 @@ rule refine:
     input:
         tree = "results/tree_raw.nwk",
         alignment = "results/aligned.fasta",
-        metadata = "data/metadata.tsv"
+        metadata = "data/metadata_all.tsv"
     output:
         tree = "results/tree.nwk",
         node_data = "results/branch_lengths.json"
@@ -189,7 +191,7 @@ rule traits:
     """
     input:
         tree = "results/tree.nwk",
-        metadata = "data/metadata.tsv"
+        metadata = "data/metadata_all.tsv"
     output:
         node_data = "results/traits.json",
     params:
@@ -212,7 +214,7 @@ rule export:
     """Exporting data files for for auspice"""
     input:
         tree = "results/tree.nwk",
-        metadata = "data/metadata.tsv",
+        metadata = "data/metadata_all.tsv",
         branch_lengths = "results/branch_lengths.json",
         traits = "results/traits.json",
         nt_muts = "results/nt_muts.json",
@@ -242,7 +244,7 @@ rule export:
 rule final_strain_name:
     input:
         auspice_json="results/raw_zika.json",
-        metadata="data/metadata.tsv",
+        metadata="data/metadata_all.tsv",
         root_sequence="results/raw_zika_root-sequence.json",
     output:
         auspice_json="auspice/zika.json",
diff --git a/phylogenetic/config/config_zika.yaml b/phylogenetic/config/config_zika.yaml
index 5345584..fa4e134 100644
--- a/phylogenetic/config/config_zika.yaml
+++ b/phylogenetic/config/config_zika.yaml
@@ -1,2 +1,2 @@
-strain_id_field: "genbank_accession"
+strain_id_field: "accession"
 display_strain_field: "strain"
\ No newline at end of file
diff --git a/phylogenetic/example_data/metadata.tsv b/phylogenetic/example_data/metadata.tsv
index 6e5345c..3d39cf9 100644
--- a/phylogenetic/example_data/metadata.tsv
+++ b/phylogenetic/example_data/metadata.tsv
@@ -1,35 +1,35 @@
-strain	virus	genbank_accession	date	region	country	division	city	db	segment	authors	url
-PAN/CDC_259359_V1_V3/2015	zika	KX156774	2015-12-18	North America	Panama	Panama	Panama	genbank	genome	Shabman et al	https://www.ncbi.nlm.nih.gov/nuccore/KX156774
-COL/FLR_00024/2015	zika	MF574569	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574569
-PRVABC59	zika	KU501215	2015-12-XX	North America	Puerto Rico	Puerto Rico	Puerto Rico	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501215
-COL/FLR_00008/2015	zika	MF574562	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574562
-Colombia/2016/ZC204Se	zika	KY317939	2016-01-06	South America	Colombia	Colombia	Colombia	genbank	genome	Quick et al	https://www.ncbi.nlm.nih.gov/nuccore/KY317939
-ZKC2/2016	zika	KX253996	2016-02-16	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Wu et al	https://www.ncbi.nlm.nih.gov/nuccore/KX253996
-VEN/UF_1/2016	zika	KX702400	2016-03-25	South America	Venezuela	Venezuela	Venezuela	genbank	genome	Blohm et al	https://www.ncbi.nlm.nih.gov/nuccore/KX702400
-DOM/2016/BB_0059	zika	KY785425	2016-04-04	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785425
-BRA/2016/FC_6706	zika	KY785433	2016-04-08	South America	Brazil	Brazil	Brazil	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785433
-DOM/2016/BB_0183	zika	KY785420	2016-04-18	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785420
-EcEs062_16	zika	KX879603	2016-04-XX	South America	Ecuador	Ecuador	Ecuador	genbank	genome	Marquez et al	https://www.ncbi.nlm.nih.gov/nuccore/KX879603
-HND/2016/HU_ME59	zika	KY785418	2016-05-13	North America	Honduras	Honduras	Honduras	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785418
-DOM/2016/MA_WGS16_011	zika	KY785484	2016-06-06	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785484
-DOM/2016/BB_0433	zika	KY785441	2016-06-13	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al	https://www.ncbi.nlm.nih.gov/nuccore/KY785441
-USA/2016/FL022	zika	KY075935	2016-07-22	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075935
-SG_027	zika	KY241697	2016-08-27	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241697
-SG_074	zika	KY241744	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241744
-SG_056	zika	KY241726	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241726
-USA/2016/FLUR022	zika	KY325473	2016-08-31	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325473
-Aedes_aegypti/USA/2016/FL05	zika	KY075937	2016-09-09	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075937
-SG_018	zika	KY241688	2016-09-13	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al	https://www.ncbi.nlm.nih.gov/nuccore/KY241688
-USA/2016/FLWB042	zika	KY325478	2016-09-26	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY325478
-COL/PRV_00028/2015	zika	MF574578	2016-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al	https://www.ncbi.nlm.nih.gov/nuccore/MF574578
-Thailand/1610acTw	zika	MF692778	2016-10-XX	Southeast Asia	Thailand	Thailand	Thailand	genbank	genome	Lin et al	https://www.ncbi.nlm.nih.gov/nuccore/MF692778
-1_0087_PF	zika	KX447509	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447509
-1_0199_PF	zika	KX447519	2013-11-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447519
-1_0181_PF	zika	KX447512	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447512
-Brazil/2015/ZBRC301	zika	KY558995	2015-05-13	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558995
-Brazil/2015/ZBRA105	zika	KY558989	2015-02-23	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558989
-Brazil/2016/ZBRC16	zika	KY558991	2016-01-19	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558991
-V8375	zika	KU501217	2015-11-01	North America	Guatemala	Guatemala	Guatemala	genbank	genome	Lanciotti et al	https://www.ncbi.nlm.nih.gov/nuccore/KU501217
-Nica1_16	zika	KX421195	2016-01-19	North America	Nicaragua	Nicaragua	Nicaragua	genbank	genome	Tabata et al	https://www.ncbi.nlm.nih.gov/nuccore/KX421195
-Brazil/2015/ZBRC303	zika	KY558997	2015-05-14	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al	https://www.ncbi.nlm.nih.gov/nuccore/KY558997
-SMGC_1	zika	KX266255	2016-02-14	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Bi et al	https://www.ncbi.nlm.nih.gov/nuccore/KX266255
+strain	virus	genbank_accession	date	region	country	division	city	db	segment	authors
+PAN/CDC_259359_V1_V3/2015	zika	KX156774	2015-12-18	North America	Panama	Panama	Panama	genbank	genome	Shabman et al
+COL/FLR_00024/2015	zika	MF574569	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al
+PRVABC59	zika	KU501215	2015-12-XX	North America	Puerto Rico	Puerto Rico	Puerto Rico	genbank	genome	Lanciotti et al
+COL/FLR_00008/2015	zika	MF574562	2015-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al
+Colombia/2016/ZC204Se	zika	KY317939	2016-01-06	South America	Colombia	Colombia	Colombia	genbank	genome	Quick et al
+ZKC2/2016	zika	KX253996	2016-02-16	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Wu et al
+VEN/UF_1/2016	zika	KX702400	2016-03-25	South America	Venezuela	Venezuela	Venezuela	genbank	genome	Blohm et al
+DOM/2016/BB_0059	zika	KY785425	2016-04-04	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al
+BRA/2016/FC_6706	zika	KY785433	2016-04-08	South America	Brazil	Brazil	Brazil	genbank	genome	Metsky et al
+DOM/2016/BB_0183	zika	KY785420	2016-04-18	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al
+EcEs062_16	zika	KX879603	2016-04-XX	South America	Ecuador	Ecuador	Ecuador	genbank	genome	Marquez et al
+HND/2016/HU_ME59	zika	KY785418	2016-05-13	North America	Honduras	Honduras	Honduras	genbank	genome	Metsky et al
+DOM/2016/MA_WGS16_011	zika	KY785484	2016-06-06	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al
+DOM/2016/BB_0433	zika	KY785441	2016-06-13	North America	Dominican Republic	Dominican Republic	Dominican Republic	genbank	genome	Metsky et al
+USA/2016/FL022	zika	KY075935	2016-07-22	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al
+SG_027	zika	KY241697	2016-08-27	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al
+SG_074	zika	KY241744	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al
+SG_056	zika	KY241726	2016-08-28	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al
+USA/2016/FLUR022	zika	KY325473	2016-08-31	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al
+Aedes_aegypti/USA/2016/FL05	zika	KY075937	2016-09-09	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al
+SG_018	zika	KY241688	2016-09-13	Southeast Asia	Singapore	Singapore	Singapore	genbank	genome	Ho et al
+USA/2016/FLWB042	zika	KY325478	2016-09-26	North America	Usa	Usa	Usa	genbank	genome	Grubaugh et al
+COL/PRV_00028/2015	zika	MF574578	2016-12-XX	South America	Colombia	Colombia	Colombia	genbank	genome	Pickett et al
+Thailand/1610acTw	zika	MF692778	2016-10-XX	Southeast Asia	Thailand	Thailand	Thailand	genbank	genome	Lin et al
+1_0087_PF	zika	KX447509	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al
+1_0199_PF	zika	KX447519	2013-11-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al
+1_0181_PF	zika	KX447512	2013-12-XX	Oceania	French Polynesia	French Polynesia	French Polynesia	genbank	genome	Pettersson et al
+Brazil/2015/ZBRC301	zika	KY558995	2015-05-13	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al
+Brazil/2015/ZBRA105	zika	KY558989	2015-02-23	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al
+Brazil/2016/ZBRC16	zika	KY558991	2016-01-19	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al
+V8375	zika	KU501217	2015-11-01	North America	Guatemala	Guatemala	Guatemala	genbank	genome	Lanciotti et al
+Nica1_16	zika	KX421195	2016-01-19	North America	Nicaragua	Nicaragua	Nicaragua	genbank	genome	Tabata et al
+Brazil/2015/ZBRC303	zika	KY558997	2015-05-14	South America	Brazil	Brazil	Brazil	genbank	genome	Faria et al
+SMGC_1	zika	KX266255	2016-02-14	Oceania	American Samoa	American Samoa	American Samoa	genbank	genome	Bi et al
diff --git a/phylogenetic/example_data/metadata_usvi.tsv b/phylogenetic/example_data/metadata_usvi.tsv
new file mode 100644
index 0000000..96d3d52
--- /dev/null
+++ b/phylogenetic/example_data/metadata_usvi.tsv
@@ -0,0 +1,2 @@
+genbank_accession	genbank_accession_rev	accession	strain	date	region	country	division	location	length	host	release_date	update_date	sra_accessions	authors	institution	url
+USVI/37/2016		VI37	USVI/37/2016	2016-10-06	North America	Usvi	Saint Croix	Saint Croix	10807	Homo sapiens				Black et al	FH	https://github.com/blab/zika-usvi/
diff --git a/phylogenetic/example_data/sequences_usvi.fasta b/phylogenetic/example_data/sequences_usvi.fasta
new file mode 100644
index 0000000..d677bfc
--- /dev/null
+++ b/phylogenetic/example_data/sequences_usvi.fasta
@@ -0,0 +1,137 @@
+>VI37
+nnnnnnnnnnnnnnnnnnnnnnnnnnnngacagttcgagtttgaagcgaaagctagcaacagtatcaacaggttttattt
+tggatttggaaacgagagtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgctaaa
+acgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggacttctgctgggtcatgggcccatca
+ggatggtcttggcgattctagcctttttgagattcacggcaatcaagccatcactgggcctcatcaatagatggggttca
+gtggggaaaaaagaggctatggaaacaataaagaagttcaagaaagatctggctgccatgctgagaataatcaatgctag
+gaaggagaagaagagacgaggcgcagatactagtgtcggaattgttggcctcctgctgaccacagctatggcagcggagg
+tcactagacgtgggagtgcatactatatgtacttggacagaaacgatgctggggaggccatatcttttccaaccacattg
+gggatgaataagtgttatatacagatcatggatcttggacacatgtgtgatgccaccatgagctatgaatgccctatgct
+ggatgagggggtggaaccagatgacgtcgattgttggtgcaacacgacgtcaacttgggttgtgtacggaacctgccatc
+acaaaaaaggtgaagcacggagatctagaagagctgtgacgctcccctcccattccaccaggaagctgcaaacgcggtcg
+caaacctggttggaatcaagagaatacacaaagcacttgattagagtcgaaaattggatattcaggaaccctggcttcgc
+gttagcagcagctgccatcgcttggcttttgggaagctcaacgagccaaaaagtcatatacttggtcatgatactgctga
+ttgccccggcatacagcatcaggtgcataggagtcagcaatagggactttgtggaaggtatgtcaggtgggacttgggtt
+gatgttgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaaccgactgtcgacatagagctggttacaac
+aacagtcagcaacatggcggaggtaagatcctactgctatgaggcatcaatatcagacatggcttctgacagccgctgcc
+caacacaaggtgaagcctaccttgacaagcaatcagacactcaatatgtctgcaaaagaacgttagtggacagaggctgg
+ggaaatggatgtggactttttggcaaagggagcctggtgacatgcgctaagtttgcatgctccaagaaaatgaccgggaa
+gagcatccagccagagaatctggagtaccggataatgctgtcagttcatggctcccagcacagtgggatgatcgttaatg
+acacaggacatgaaactgatgagaatagagcgaaagttgagataacgcccaattcaccgagagccgaagccaccctgggg
+ggttttggaagcctaggacttgattgtgaaccgaggacaggccttgacttttcagatttgtattacttgactatgaataa
+caagcactggttggttcacaaggagtggttccacgacattccattaccttggcacgctggggcagacaccggaactccac
+actggaacaacaaagaagcactggtagagttcaaggacgcacatgccaaaaggcaaactgtcgtggttctagggagtcaa
+gaaggagcagttcacacggcccttgctggagctctggaggctgagatggatggtgcaaagggaaggctgtcctctggcca
+cttgaaatgtcgcctgaaaatggataaacttagattgaagggcgtgtcatactccttgtgtactgcagcgttcacattca
+ccaagatcccggctgaaacactgcacgggacagtcacagtggaggtacagtacgcagggacagatggaccttgcaaggtt
+ccagctcagatggcggtggacatgcaaactctgaccccagttgggaggttgataaccgctaaccccgtaatcactgaaag
+cactgagaactctaagatgatgctggaacttgatccaccatttggggactcttacattgtcataggagtcggggagaaga
+agatcacccaccactggcacaggagtggcagcaccattggaaaagcatttgaagccactgtgagaggtgccaagagaatg
+gcagtcttgggagacacagcctgggactttggatcagttggaggcgctctcaactcattgggcaagggcatccatcaaat
+ttttggagcagctttcaaatcattgtttggaggaatgtcctggttctcacaaattctcattggaacgttgctgatgtggt
+tgggtctgaacacaaagaatggatctatttcccttatgtgcttggccttagggggagtgttgatcttcttatccacagcc
+gtctctgctgatgtggggtgctcggtggacttctcaaagaaggagacgagatgcggtacaggggtgttcgtctataacga
+cgttgaagcctggagggacaggtacaagtaccatcctgactccccccgtagattggcagcagcagttaagcaagcctggg
+aagatggtatctgcgggatctcctctgtttcaagaatggaaaacatcatgtggagatcagtagaaggggagctcaacgca
+atcctggaagagaatggagttcaactgacggtcgttgtgggatctgtaaaaaaccccatgtggagaggtccacagagatt
+gcccgtgcctgtgaacgagctgccccacggctggaaggcttgggggaaatcgtacttcgtcagagcagcaaagacaaata
+acagctttgtcgtggatggtgacacactgaaggaatgcccactcaaacatagagcatggaacagctttcttgtggaggat
+catgggttcggggtatttcacactagtgtctggctcaaggttagagaagattattcattagagtgtgatccagccgttat
+tggaacagctgttaagggaaaggaggctgtacacagtgatctaggctactggattgagagtgagaagaatgacacatgga
+ggctggagagggcccatctgatcgagatgaaaacatgtgaatggccaaagtcccacacattgtggacagatggaatagaa
+gagagtgatctgatcatacccaagtctttagctgggccactcagccatcacaataccagagagggctacaggacccaaat
+gaaagggccatggcacagtgaagagcttgaaattcggtttgaggaatgcccaggcactaaggtccacgtggaggaaacat
+gtggaacaagaggaccatctctgagatcaaccactgcaagcggaagggtgatcgaggaatggtgctgcagggagtgcaca
+atgcccccactgtcgttccgggctaaagatggctgttggtatggaatggagataaggcccaggaaagaaccagaaagcaa
+cttagtaaggtcaatggtgactgcaggatcaactgatcacatggaccacttctcccttggagtgcttgtgatcctgctca
+tggtgcaggaagggctgaagaagagaatgaccacaaagatcatcataagcacatcaatggcagtgctggtagctatgatc
+ctgggaggattttcaatgagtgacctggctaagcttgcaattttgatgggtgccaccttcgcggaaatgaacactggagg
+agatgtagctcatctggcgctgatagcggcattcaaagtcagaccagcgttgctggtatctttcatcttcagagctaatt
+ggacaccccgtgaaagcatgctgctggccttggcctcgtgtcttttgcaaactgcgatctccgccttggaaggcgacctg
+atggttctcatcaatggttttgctttggcctggttggcaatacgagcgatggttgttccacgcactgataacatcacctt
+ggcaatcctggctgctctgacaccactggcccggggcacactgcttgtggcgtggagagcaggccttgctacttgcgggg
+ggtttatgctcctctctctgaagggaaaaggcagtgtgaagaagaacttaccatttgtcatggccctgggactaaccgct
+gtgaggctggtcgaccccatcaacgtggtgggactgctgttgctcacaaggagtgggaagcggagctggccccctagcga
+agtactcacagctgttggcctgatatgcgcattggctggagggttcgccaaggcagatatagagatggctgggcccatgg
+ccgcggtcggtctgctaattgtcagttacgtggtctcaggaaagagtgtggacatgtacattgaaagagcaggtgacatc
+acatgggaaaaagatgcggaagtcactggaaacagtccccggctcgatgtggcgctagatgagagtggtgatttctccct
+ggtggaggatgacggtccccccatgagagagatcatactcaaggtggtcctgatgaccatctgtggcatgaacccaatag
+ccataccctttgcagctggagcgtggtacgtatacgtgaagactggaaaaaggagtggtgctctatgggatgtgcctgct
+cccaaggaagtaaaaaagggggagaccacagatggagtgtacagagtaatgactcgtagactgctaggttcaacacaagt
+tggagtgggagttatgcaagagggggtctttcacactatgtggcacgtcacaaaaggatccgcgctgagaagcggtgaag
+ggagacttgatccatactggggagatgtcaagcaggatctggtgtcatactgtggtccatggaagctagatgccgcctgg
+gatgggcacagcgaggtgcagctcttggccgtgccccccggagagagagcgaggaacatccagactctgcccggaatatt
+taagacaaaggatggggacattggagcggttgcgctggattacccagcaggaacttcaggatctccaatcctagacaagt
+gtgggagagtgataggactttatggcaatggggtcgtgatcaaaaacgggagttatgttagtgccatcacccaagggagg
+agggaggaagagactcctgttgagtgcttcgagccctcgatgctgaagaagaagcagctaactgtcttagacttgcatcc
+tggagctgggaaaaccaggagagttcttcctgaaatagtccgtgaagccataaaaacaagactccgtactgtgatcttag
+ctccaaccagggttgtcgctgctgaaatggaggaggcccttagagggcttccagtgcgttatatgacaacagcagtcaat
+gtcacccactctggaacagaaatcgtcgacttaatgtgccatgccaccttcacttcacgtctactacagccaatcagagt
+ccccaactataatctgtatattatggatgaggcccacttcacagatccctcaagtatagcagcaagaggatacatttcaa
+caagggttgagatgggcgaggcggctgccatcttcatgaccgccacgccaccaggaacccgtgacgcatttccggactcc
+aactcaccaattatggacaccgaagtggaagtcccagagagagcctggagctcaggctttgattgggtgacggatcattc
+tggaaaaacagtttggtttgttccaagcgtgaggaacggcaatgagatcgcagcttgtctgacaaaggctggaaaacggg
+tcatacagctcagcagaaagacttttgagacagagttccagaaaacaaaacatcaagagtgggactttgtcgtgacaact
+gacatttcagagatgggcgccaactttaaagctgaccgtgtcatagattccaggagatgcctaaagccggtcatacttga
+tggcgagagagtcattctggctggacccatgcctgtcacacatgccagcgctgcccagaggagggggcgcataggcagga
+atcccaacaaacctggagatgagtatctgtatggaggtgggtgcgcagagactgacgaagaccatgcacactggcttgaa
+gcaagaatgctccttgacaatatttacctccaagatggcctcatagcctcgctctatcgacctgaggccgacaaagtagc
+agccattgagggagagttcaagcttaggacggagcaaaggaagacctttgtggaactcatgaaaagaggagatcttcctg
+tttggctggcctatcaggttgcatctgccggaataacctacacagatagaagatggtgctttgatggcacgaccaacaac
+accataatggaagacagtgtgccggcagaggtgtggaccagacacggagagaaaagagtgctcaaaccgaggtggatgga
+cgccagagtttgttcagatcatgcggccctgaagtcattcaaggagtttgccgctgggaaaagaggagcggcttttggag
+tgatggaagccctgggaacactgccaggacacatgacnnagagattccaggaagcnattgacaacctcgctgtgctcatg
+cgngcagagactggaagcaggccttacaaagccgcggcggcccaattgccggagaccctagagaccataatgcntttggg
+gttgctgggaacagtctcgctgggaatcttcttcgtcttgatgaggaacaagggcatagggaagatgggctttggaatgg
+tgactcttggggccagcgcatggctcatgtggctctcggaaattgagccagccagaattgcatgtgtcctcattgttgtg
+ttcctattgctggtggtgctcatacctgagccagaaaagcaaagatctccccaggacaaccaaatggcaatcatcatcat
+ggtagcagtaggtcttttgggcttgattaccgccaatgaactcggatggttggagagaacaaagagtgacctaagccatc
+taatgggaaggagagaggagggggcaaccataggattctcaatggacattgacctgcggccagcctcagcttgggccatc
+tatgctgccttgacaactttcattaccccagccgtccaacatgcagtgaccacctcatacaacaactactccttaatggc
+gatggccacgcaagctggagtgttgtttggcatgggcaaagggatgccattctacgcatgggactttggagtcccgctgc
+taatgataggttgctactcacaattaacacccctgaccctaatagtggccatcattttgctcgtggcgcactacatgtac
+ttgatcccagggctgcaggcagcagctgcgcgtgctgcccagaagagaacggcagctggcatcatgaagaaccctgttgt
+ggatggaatagtggtgactgacattgacacaatgacaattgacccccaagtggagaaaaagatgggacaggtgctactca
+tagcagtggccgtctccagcgccatactgtcgcggaccgcctgggggtggggggaggctggggctctgatcacagccgca
+acttccactttgtgggaaggctctccgaacaagtactggaactcctctacagccacttcactgtgtaacatttttagggg
+aagttacttggctggagcttctctaatctacacagtaacaagaaacgctggcttggtcaagagacgtgggggtggaacag
+gagagaccctgggagagaaatggaaggcccgcttgaaccagatgtcggccctggagttctactcctacaaaaagtcaggc
+atcaccgaggtgtgcagagaagaggcccgccgcgccctcaaggacggtgtggcaacgggaggccatgctgtgtcccgagg
+aagtgcaaagctgagatggttggtggagcggggatacctgcagccctatggaaaggtcattgatcttggatgtggcagag
+ggggctggagttactacgccgccaccatccgcaaagttcaagaagtgaaaggatacacaaaaggaggccctggtcatgaa
+gaacccgtgttggtgcaaagctatgggtggaacatagtccgtcttaagagtggggtggacgtctttcatatggcggctga
+gccgtgtgacacgttgctgtgtgacataggtgagtcatcatctagtcctgaagtggaagaagcacggacgctcagagtcc
+tctccatggtgggggattggcttgaaaaaagaccaggagccttttgtataaaagtgttgtgcccatacaccagcactatg
+atggaaaccctggagcgactgcagcgtaggtatgggggaggactggtcagagtgccactctcccgcaactctacacatga
+gatgtactgggtctctggagcgaaaagcaacaccataaaaagtgtgtccaccacgagccagctcctcttggggcgcatgg
+acgggcctaggaggccagtgaaatatgaggaggatgtgaatctcggctctggcacgcgggctgtggtaagctgcgctgaa
+gctcccaacatgaagatcattggtaaccgcattgaaaggatccgcagtgagcacgcggaaacgtggttctttgacgagaa
+ccacccatataggacatgggcttaccatggaagctatgaggcccccacacaagggtcagcgtcctctctaataaacgggg
+ttgtcaggctcctgtcaaaaccctgggatgtggtgactggagtcacaggaatagccatgaccgacaccacaccgtatggt
+cagcaaagagttttcaaggaaaaagtggacactagggtgccagacccccaagaaggcactcgtcaggttatgagcatggt
+ctcttcctggttgtggaaagagctaggcaaacacaaacggccacgagtctgcaccaaagaagagttcatcaacaaggttc
+gtagcaatgcagcattaggggcaatatttgaggaggaaaaagagtggaagactgcagtggaagctgtgaacgatccaagg
+ttctgggctctagtggacaaggaaagagagcaccacctgagaggagagtgccagagctgtgtgtacaacatgatgggaaa
+aagagaaaagaaacaaggggaatttggaaaggccaagggcagccgcgccatctggtatatgtggctaggggctagatttc
+tagagttcgaagcccttggattcttgaacgaggatcactggatggggagagagaactcaggaggtggtgttgaagggctg
+ggattacaaagactcggatatgtcctagaagagatgagtcgtataccaggaggaaggatgtatgcagatgacactgctgg
+ctgggacacccgcattagcaggtttgatctggagaatgaagctctaatcaccaaccaaatggagaaagggcacagggcct
+tggcattggccataatcaagtacacataccaaaacaaagtggtaaaggtccttagaccagctgaaaaagggaaaacagtt
+atggacattatttcgagacaagaccaaagggggagcggacaagttgtcacttacgctcttaacacatttaccaacctagt
+ggtgcaactcattcggaatatggaggctgaggaagttctagagatgcaagacttgtggctgctgcggaggtcagagaaag
+tgaccaactggttgcagagcaacggatgggataggctcaaacgaatggcagtcagtggagatgattgcgttgtgaagcca
+attgatgataggtttgcacatgccctcaggttcttgaatgatatgggaaaagttaggaaggacacacaagagtggaaacc
+ctcaactggatgggacaactgggaagaagttccgttttgctcccaccacttcaacaagctccatctcaaggacgggaggt
+ccattgtggttccctgccgccaccaagatgaactgattggtcgggcccgcgtctctccaggggcgggatggagcatccgg
+gagactgcttgcctagcaaaatcatatgcgcaaatgtggcagctcctttatttccacagaagggacctccgactgatggc
+caatgccatttgttcatctgtgccagttgactgggttccaactgggagaactacctggtcaatccatggaaagggagaat
+ggatgaccactgaagacatgcttgtggtgtggaacagagtgtggatnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
+nnnnnnn
diff --git a/phylogenetic/rules/usvi.smk b/phylogenetic/rules/usvi.smk
new file mode 100644
index 0000000..5ae8b3d
--- /dev/null
+++ b/phylogenetic/rules/usvi.smk
@@ -0,0 +1,52 @@
+rule download_usvi:
+    """Downloading sequences and metadata from data.nextstrain.org"""
+    output:
+        sequences = "data/sequences_usvi.fasta.zst",
+        metadata = "data/metadata_usvi.tsv.zst"
+    params:
+        sequences_url = "https://data.nextstrain.org/files/zika/sequences_usvi.fasta.zst",
+        metadata_url = "https://data.nextstrain.org/files/zika/metadata_usvi.tsv.zst"
+    shell:
+        """
+        curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
+        curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
+        """
+
+rule decompress_usvi:
+    """Decompressing sequences and metadata"""
+    input:
+        sequences = "data/sequences_usvi.fasta.zst",
+        metadata = "data/metadata_usvi.tsv.zst"
+    output:
+        sequences = "data/sequences_usvi.fasta",
+        metadata = "data/metadata_usvi.tsv"
+    shell:
+        """
+        zstd -d -c {input.sequences} > {output.sequences}
+        zstd -d -c {input.metadata} > {output.metadata}
+        """
+
+rule append_usvi:
+    """Appending USVI sequences"""
+    input:
+        sequences = "data/sequences.fasta",
+        metadata = "data/metadata.tsv",
+        usvi_sequences = "data/sequences_usvi.fasta",
+        usvi_metadata = "data/metadata_usvi.tsv"
+    output:
+        sequences = "data/sequences_all.fasta",
+        metadata = "data/metadata_all.tsv"
+    shell:
+        """
+        cat {input.sequences} {input.usvi_sequences} > {output.sequences}
+
+        csvtk mutate2 -tl \
+          -n url \
+          -e '"https://www.ncbi.nlm.nih.gov/nuccore/" + $genbank_accession' \
+          {input.metadata} \
+        | csvtk mutate2 -tl \
+          -n accession \
+          -e '$genbank_accession' \
+        | csvtk concat -tl - {input.usvi_metadata} \
+        > {output.metadata}
+        """
\ No newline at end of file
diff --git a/phylogenetic/scripts/set_final_strain_name.py b/phylogenetic/scripts/set_final_strain_name.py
index c670f44..d104ca1 100644
--- a/phylogenetic/scripts/set_final_strain_name.py
+++ b/phylogenetic/scripts/set_final_strain_name.py
@@ -6,7 +6,6 @@ def replace_name_recursive(node, lookup, saveoldcolumn):
     if node["name"] in lookup:
         if saveoldcolumn == "accession":
             node["node_attrs"][saveoldcolumn] = node["name"]
-            node["node_attrs"]["url"] = "https://www.ncbi.nlm.nih.gov/nuccore/" + node["name"]
         elif saveoldcolumn == "genbank_accession":
             node["node_attrs"][saveoldcolumn] = {}
             node["node_attrs"][saveoldcolumn]["value"] = node["name"]

From 11123b92889dbd3e0f79824ce41b5da8d908fa6b Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 11:40:59 -0800
Subject: [PATCH 13/28] Move rules for preparing sequences to its own smk file

Part of work to update this repo to match the pathogen-repo-template.
---
 phylogenetic/Snakefile                   |  83 +-----------------
 phylogenetic/rules/prepare_sequences.smk | 104 +++++++++++++++++++++++
 2 files changed, 105 insertions(+), 82 deletions(-)
 create mode 100644 phylogenetic/rules/prepare_sequences.smk

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index ca29a68..5c73580 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -17,88 +17,7 @@ rule files:
 files = rules.files.params
 
 include: "rules/usvi.smk"
-
-rule download:
-    """Downloading sequences and metadata from data.nextstrain.org"""
-    output:
-        sequences = "data/sequences.fasta.zst",
-        metadata = "data/metadata.tsv.zst"
-    params:
-        sequences_url = "https://data.nextstrain.org/files/zika/sequences.fasta.zst",
-        metadata_url = "https://data.nextstrain.org/files/zika/metadata.tsv.zst"
-    shell:
-        """
-        curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
-        curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
-        """
-
-rule decompress:
-    """Decompressing sequences and metadata"""
-    input:
-        sequences = "data/sequences.fasta.zst",
-        metadata = "data/metadata.tsv.zst"
-    output:
-        sequences = "data/sequences.fasta",
-        metadata = "data/metadata.tsv"
-    shell:
-        """
-        zstd -d -c {input.sequences} > {output.sequences}
-        zstd -d -c {input.metadata} > {output.metadata}
-        """
-
-rule filter:
-    """
-    Filtering to
-      - {params.sequences_per_group} sequence(s) per {params.group_by!s}
-      - from {params.min_date} onwards
-      - excluding strains in {input.exclude}
-      - minimum genome length of {params.min_length} (50% of Zika virus genome)
-    """
-    input:
-        sequences = "data/sequences_all.fasta",
-        metadata = "data/metadata_all.tsv",
-        exclude = files.dropped_strains
-    output:
-        sequences = "results/filtered.fasta"
-    params:
-        group_by = "country year month",
-        sequences_per_group = 40,
-        min_date = 2012,
-        min_length = 5385,
-        strain_id = config.get("strain_id_field", "strain"),
-    shell:
-        """
-        augur filter \
-            --sequences {input.sequences} \
-            --metadata {input.metadata} \
-            --metadata-id-columns {params.strain_id} \
-            --exclude {input.exclude} \
-            --output {output.sequences} \
-            --group-by {params.group_by} \
-            --sequences-per-group {params.sequences_per_group} \
-            --min-date {params.min_date} \
-            --min-length {params.min_length}
-        """
-
-rule align:
-    """
-    Aligning sequences to {input.reference}
-      - filling gaps with N
-    """
-    input:
-        sequences = "results/filtered.fasta",
-        reference = files.reference
-    output:
-        alignment = "results/aligned.fasta"
-    shell:
-        """
-        augur align \
-            --sequences {input.sequences} \
-            --reference-sequence {input.reference} \
-            --output {output.alignment} \
-            --fill-gaps \
-            --remove-reference
-        """
+include: "rules/prepare_sequences.smk"
 
 rule tree:
     """Building tree"""
diff --git a/phylogenetic/rules/prepare_sequences.smk b/phylogenetic/rules/prepare_sequences.smk
new file mode 100644
index 0000000..255b87b
--- /dev/null
+++ b/phylogenetic/rules/prepare_sequences.smk
@@ -0,0 +1,104 @@
+"""
+This part of the workflow prepares sequences for constructing the phylogenetic tree.
+
+REQUIRED INPUTS:
+
+    metadata_url    = url to metadata.tsv.zst
+    sequences_url   = url to sequences.fasta.zst
+    reference   = path to reference sequence or genbank
+
+OUTPUTS:
+
+    prepared_sequences = results/aligned.fasta
+
+This part of the workflow usually includes the following steps:
+
+    - augur index
+    - augur filter
+    - augur align
+    - augur mask
+
+See Augur's usage docs for these commands for more details.
+"""
+
+rule download:
+    """Downloading sequences and metadata from data.nextstrain.org"""
+    output:
+        sequences = "data/sequences.fasta.zst",
+        metadata = "data/metadata.tsv.zst"
+    params:
+        sequences_url = "https://data.nextstrain.org/files/zika/sequences.fasta.zst",
+        metadata_url = "https://data.nextstrain.org/files/zika/metadata.tsv.zst"
+    shell:
+        """
+        curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
+        curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
+        """
+
+rule decompress:
+    """Decompressing sequences and metadata"""
+    input:
+        sequences = "data/sequences.fasta.zst",
+        metadata = "data/metadata.tsv.zst"
+    output:
+        sequences = "data/sequences.fasta",
+        metadata = "data/metadata.tsv"
+    shell:
+        """
+        zstd -d -c {input.sequences} > {output.sequences}
+        zstd -d -c {input.metadata} > {output.metadata}
+        """
+
+rule filter:
+    """
+    Filtering to
+      - {params.sequences_per_group} sequence(s) per {params.group_by!s}
+      - from {params.min_date} onwards
+      - excluding strains in {input.exclude}
+      - minimum genome length of {params.min_length} (50% of Zika virus genome)
+    """
+    input:
+        sequences = "data/sequences_all.fasta",
+        metadata = "data/metadata_all.tsv",
+        exclude = files.dropped_strains
+    output:
+        sequences = "results/filtered.fasta"
+    params:
+        group_by = "country year month",
+        sequences_per_group = 40,
+        min_date = 2012,
+        min_length = 5385,
+        strain_id = config.get("strain_id_field", "strain"),
+    shell:
+        """
+        augur filter \
+            --sequences {input.sequences} \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --exclude {input.exclude} \
+            --output {output.sequences} \
+            --group-by {params.group_by} \
+            --sequences-per-group {params.sequences_per_group} \
+            --min-date {params.min_date} \
+            --min-length {params.min_length}
+        """
+
+rule align:
+    """
+    Aligning sequences to {input.reference}
+      - filling gaps with N
+    """
+    input:
+        sequences = "results/filtered.fasta",
+        reference = files.reference
+    output:
+        alignment = "results/aligned.fasta"
+    shell:
+        """
+        augur align \
+            --sequences {input.sequences} \
+            --reference-sequence {input.reference} \
+            --output {output.alignment} \
+            --fill-gaps \
+            --remove-reference
+        """
\ No newline at end of file

From 59ef9266231986b0021c10f97cc36135dd2277b6 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 11:51:27 -0800
Subject: [PATCH 14/28] Move rules for constructing phylogeny to its own smk
 file

Part of work to update this repo to match the pathogen-repo-template.
---
 phylogenetic/Snakefile                     | 50 +---------------
 phylogenetic/rules/construct_phylogeny.smk | 69 ++++++++++++++++++++++
 2 files changed, 70 insertions(+), 49 deletions(-)
 create mode 100644 phylogenetic/rules/construct_phylogeny.smk

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 5c73580..091e3aa 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -18,55 +18,7 @@ files = rules.files.params
 
 include: "rules/usvi.smk"
 include: "rules/prepare_sequences.smk"
-
-rule tree:
-    """Building tree"""
-    input:
-        alignment = "results/aligned.fasta"
-    output:
-        tree = "results/tree_raw.nwk"
-    shell:
-        """
-        augur tree \
-            --alignment {input.alignment} \
-            --output {output.tree}
-        """
-
-rule refine:
-    """
-    Refining tree
-      - estimate timetree
-      - use {params.coalescent} coalescent timescale
-      - estimate {params.date_inference} node dates
-      - filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
-    """
-    input:
-        tree = "results/tree_raw.nwk",
-        alignment = "results/aligned.fasta",
-        metadata = "data/metadata_all.tsv"
-    output:
-        tree = "results/tree.nwk",
-        node_data = "results/branch_lengths.json"
-    params:
-        coalescent = "opt",
-        date_inference = "marginal",
-        clock_filter_iqd = 4,
-        strain_id = config.get("strain_id_field", "strain"),
-    shell:
-        """
-        augur refine \
-            --tree {input.tree} \
-            --alignment {input.alignment} \
-            --metadata {input.metadata} \
-            --metadata-id-columns {params.strain_id} \
-            --output-tree {output.tree} \
-            --output-node-data {output.node_data} \
-            --timetree \
-            --coalescent {params.coalescent} \
-            --date-confidence \
-            --date-inference {params.date_inference} \
-            --clock-filter-iqd {params.clock_filter_iqd}
-        """
+include: "rules/construct_phylogeny.smk"
 
 rule ancestral:
     """Reconstructing ancestral sequences and mutations"""
diff --git a/phylogenetic/rules/construct_phylogeny.smk b/phylogenetic/rules/construct_phylogeny.smk
new file mode 100644
index 0000000..efc5ff6
--- /dev/null
+++ b/phylogenetic/rules/construct_phylogeny.smk
@@ -0,0 +1,69 @@
+"""
+This part of the workflow constructs the phylogenetic tree.
+
+REQUIRED INPUTS:
+
+    metadata            = data/metadata_all.tsv
+    prepared_sequences  = results/aligned.fasta
+
+OUTPUTS:
+
+    tree            = results/tree.nwk
+    branch_lengths  = results/branch_lengths.json
+
+This part of the workflow usually includes the following steps:
+
+    - augur tree
+    - augur refine
+
+See Augur's usage docs for these commands for more details.
+"""
+
+rule tree:
+    """Building tree"""
+    input:
+        alignment = "results/aligned.fasta"
+    output:
+        tree = "results/tree_raw.nwk"
+    shell:
+        """
+        augur tree \
+            --alignment {input.alignment} \
+            --output {output.tree}
+        """
+
+rule refine:
+    """
+    Refining tree
+      - estimate timetree
+      - use {params.coalescent} coalescent timescale
+      - estimate {params.date_inference} node dates
+      - filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
+    """
+    input:
+        tree = "results/tree_raw.nwk",
+        alignment = "results/aligned.fasta",
+        metadata = "data/metadata_all.tsv"
+    output:
+        tree = "results/tree.nwk",
+        node_data = "results/branch_lengths.json"
+    params:
+        coalescent = "opt",
+        date_inference = "marginal",
+        clock_filter_iqd = 4,
+        strain_id = config.get("strain_id_field", "strain"),
+    shell:
+        """
+        augur refine \
+            --tree {input.tree} \
+            --alignment {input.alignment} \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --output-tree {output.tree} \
+            --output-node-data {output.node_data} \
+            --timetree \
+            --coalescent {params.coalescent} \
+            --date-confidence \
+            --date-inference {params.date_inference} \
+            --clock-filter-iqd {params.clock_filter_iqd}
+        """

From 446764668189c81198151a45e8af66a96e7c7086 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 11:52:58 -0800
Subject: [PATCH 15/28] Move rules for annotating phylogeny to its own smk file

Part of work to update this repo to match the pathogen-repo-template.
---
 phylogenetic/Snakefile                    | 62 +--------------
 phylogenetic/rules/annotate_phylogeny.smk | 93 +++++++++++++++++++++++
 2 files changed, 94 insertions(+), 61 deletions(-)
 create mode 100644 phylogenetic/rules/annotate_phylogeny.smk

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 091e3aa..d31d4e7 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -19,67 +19,7 @@ files = rules.files.params
 include: "rules/usvi.smk"
 include: "rules/prepare_sequences.smk"
 include: "rules/construct_phylogeny.smk"
-
-rule ancestral:
-    """Reconstructing ancestral sequences and mutations"""
-    input:
-        tree = "results/tree.nwk",
-        alignment = "results/aligned.fasta"
-    output:
-        node_data = "results/nt_muts.json"
-    params:
-        inference = "joint"
-    shell:
-        """
-        augur ancestral \
-            --tree {input.tree} \
-            --alignment {input.alignment} \
-            --output-node-data {output.node_data} \
-            --inference {params.inference}
-        """
-
-rule translate:
-    """Translating amino acid sequences"""
-    input:
-        tree = "results/tree.nwk",
-        node_data = "results/nt_muts.json",
-        reference = files.reference
-    output:
-        node_data = "results/aa_muts.json"
-    shell:
-        """
-        augur translate \
-            --tree {input.tree} \
-            --ancestral-sequences {input.node_data} \
-            --reference-sequence {input.reference} \
-            --output {output.node_data} \
-        """
-
-rule traits:
-    """
-    Inferring ancestral traits for {params.columns!s}
-      - increase uncertainty of reconstruction by {params.sampling_bias_correction} to partially account for sampling bias
-    """
-    input:
-        tree = "results/tree.nwk",
-        metadata = "data/metadata_all.tsv"
-    output:
-        node_data = "results/traits.json",
-    params:
-        columns = "region country",
-        sampling_bias_correction = 3,
-        strain_id = config.get("strain_id_field", "strain"),
-    shell:
-        """
-        augur traits \
-            --tree {input.tree} \
-            --metadata {input.metadata} \
-            --metadata-id-columns {params.strain_id} \
-            --output {output.node_data} \
-            --columns {params.columns} \
-            --confidence \
-            --sampling-bias-correction {params.sampling_bias_correction}
-        """
+include: "rules/annotate_phylogeny.smk"
 
 rule export:
     """Exporting data files for for auspice"""
diff --git a/phylogenetic/rules/annotate_phylogeny.smk b/phylogenetic/rules/annotate_phylogeny.smk
new file mode 100644
index 0000000..257915c
--- /dev/null
+++ b/phylogenetic/rules/annotate_phylogeny.smk
@@ -0,0 +1,93 @@
+"""
+This part of the workflow creates additonal annotations for the phylogenetic tree.
+
+REQUIRED INPUTS:
+
+    metadata            = data/metadata_all.tsv
+    prepared_sequences  = results/aligned.fasta
+    tree                = results/tree.nwk
+
+OUTPUTS:
+
+    node_data = results/*.json
+
+    There are no required outputs for this part of the workflow as it depends
+    on which annotations are created. All outputs are expected to be node data
+    JSON files that can be fed into `augur export`.
+
+    See Nextstrain's data format docs for more details on node data JSONs:
+    https://docs.nextstrain.org/page/reference/data-formats.html
+
+This part of the workflow usually includes the following steps:
+
+    - augur traits
+    - augur ancestral
+    - augur translate
+    - augur clades
+
+See Augur's usage docs for these commands for more details.
+
+Custom node data files can also be produced by build-specific scripts in addition
+to the ones produced by Augur commands.
+"""
+
+rule ancestral:
+    """Reconstructing ancestral sequences and mutations"""
+    input:
+        tree = "results/tree.nwk",
+        alignment = "results/aligned.fasta"
+    output:
+        node_data = "results/nt_muts.json"
+    params:
+        inference = "joint"
+    shell:
+        """
+        augur ancestral \
+            --tree {input.tree} \
+            --alignment {input.alignment} \
+            --output-node-data {output.node_data} \
+            --inference {params.inference}
+        """
+
+rule translate:
+    """Translating amino acid sequences"""
+    input:
+        tree = "results/tree.nwk",
+        node_data = "results/nt_muts.json",
+        reference = files.reference
+    output:
+        node_data = "results/aa_muts.json"
+    shell:
+        """
+        augur translate \
+            --tree {input.tree} \
+            --ancestral-sequences {input.node_data} \
+            --reference-sequence {input.reference} \
+            --output {output.node_data} \
+        """
+
+rule traits:
+    """
+    Inferring ancestral traits for {params.columns!s}
+      - increase uncertainty of reconstruction by {params.sampling_bias_correction} to partially account for sampling bias
+    """
+    input:
+        tree = "results/tree.nwk",
+        metadata = "data/metadata_all.tsv"
+    output:
+        node_data = "results/traits.json",
+    params:
+        columns = "region country",
+        sampling_bias_correction = 3,
+        strain_id = config.get("strain_id_field", "strain"),
+    shell:
+        """
+        augur traits \
+            --tree {input.tree} \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --output {output.node_data} \
+            --columns {params.columns} \
+            --confidence \
+            --sampling-bias-correction {params.sampling_bias_correction}
+        """

From 697694697c0deaca71544eab6cd5abb735475e50 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 11:56:00 -0800
Subject: [PATCH 16/28] Move rules for exporting auspice json to its own smk
 file

Part of work to update this repo to match the pathogen-repo-template.
---
 phylogenetic/Snakefile        | 55 +-----------------------
 phylogenetic/rules/export.smk | 80 +++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+), 54 deletions(-)
 create mode 100644 phylogenetic/rules/export.smk

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index d31d4e7..d599360 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -20,60 +20,7 @@ include: "rules/usvi.smk"
 include: "rules/prepare_sequences.smk"
 include: "rules/construct_phylogeny.smk"
 include: "rules/annotate_phylogeny.smk"
-
-rule export:
-    """Exporting data files for for auspice"""
-    input:
-        tree = "results/tree.nwk",
-        metadata = "data/metadata_all.tsv",
-        branch_lengths = "results/branch_lengths.json",
-        traits = "results/traits.json",
-        nt_muts = "results/nt_muts.json",
-        aa_muts = "results/aa_muts.json",
-        colors = files.colors,
-        auspice_config = files.auspice_config,
-        description = files.description
-    output:
-        auspice_json = "results/raw_zika.json",
-        root_sequence = "results/raw_zika_root-sequence.json",
-    params:
-        strain_id = config.get("strain_id_field", "strain"),
-    shell:
-        """
-        augur export v2 \
-            --tree {input.tree} \
-            --metadata {input.metadata} \
-            --metadata-id-columns {params.strain_id} \
-            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
-            --colors {input.colors} \
-            --auspice-config {input.auspice_config} \
-            --description {input.description} \
-            --include-root-sequence \
-            --output {output.auspice_json}
-        """
-
-rule final_strain_name:
-    input:
-        auspice_json="results/raw_zika.json",
-        metadata="data/metadata_all.tsv",
-        root_sequence="results/raw_zika_root-sequence.json",
-    output:
-        auspice_json="auspice/zika.json",
-        root_sequence="auspice/zika_root-sequence.json",
-    params:
-        strain_id=config["strain_id_field"],
-        display_strain_field=config.get("display_strain_field", "strain"),
-    shell:
-        """
-        python3 scripts/set_final_strain_name.py \
-            --metadata {input.metadata} \
-            --metadata-id-columns {params.strain_id} \
-            --input-auspice-json {input.auspice_json} \
-            --display-strain-name {params.display_strain_field} \
-            --output {output.auspice_json}
-
-        cp {input.root_sequence} {output.root_sequence}
-        """
+include: "rules/export.smk"
 
 rule clean:
     """Removing directories: {params}"""
diff --git a/phylogenetic/rules/export.smk b/phylogenetic/rules/export.smk
new file mode 100644
index 0000000..7dbe431
--- /dev/null
+++ b/phylogenetic/rules/export.smk
@@ -0,0 +1,80 @@
+"""
+This part of the workflow collects the phylogenetic tree and annotations to
+export a Nextstrain dataset.
+
+REQUIRED INPUTS:
+
+    metadata        = data/metadata_all.tsv
+    tree            = results/tree.nwk
+    branch_lengths  = results/branch_lengths.json
+    node_data       = results/*.json
+
+OUTPUTS:
+
+    auspice_json = auspice/${build_name}.json
+
+    There are optional sidecar JSON files that can be exported as part of the dataset.
+    See Nextstrain's data format docs for more details on sidecar files:
+    https://docs.nextstrain.org/page/reference/data-formats.html
+
+This part of the workflow usually includes the following steps:
+
+    - augur export v2
+    - augur frequencies
+
+See Augur's usage docs for these commands for more details.
+"""
+
+rule export:
+    """Exporting data files for for auspice"""
+    input:
+        tree = "results/tree.nwk",
+        metadata = "data/metadata_all.tsv",
+        branch_lengths = "results/branch_lengths.json",
+        traits = "results/traits.json",
+        nt_muts = "results/nt_muts.json",
+        aa_muts = "results/aa_muts.json",
+        colors = files.colors,
+        auspice_config = files.auspice_config,
+        description = files.description
+    output:
+        auspice_json = "results/raw_zika.json",
+        root_sequence = "results/raw_zika_root-sequence.json",
+    params:
+        strain_id = config.get("strain_id_field", "strain"),
+    shell:
+        """
+        augur export v2 \
+            --tree {input.tree} \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
+            --colors {input.colors} \
+            --auspice-config {input.auspice_config} \
+            --description {input.description} \
+            --include-root-sequence \
+            --output {output.auspice_json}
+        """
+
+rule final_strain_name:
+    input:
+        auspice_json="results/raw_zika.json",
+        metadata="data/metadata_all.tsv",
+        root_sequence="results/raw_zika_root-sequence.json",
+    output:
+        auspice_json="auspice/zika.json",
+        root_sequence="auspice/zika_root-sequence.json",
+    params:
+        strain_id=config["strain_id_field"],
+        display_strain_field=config.get("display_strain_field", "strain"),
+    shell:
+        """
+        python3 scripts/set_final_strain_name.py \
+            --metadata {input.metadata} \
+            --metadata-id-columns {params.strain_id} \
+            --input-auspice-json {input.auspice_json} \
+            --display-strain-name {params.display_strain_field} \
+            --output {output.auspice_json}
+
+        cp {input.root_sequence} {output.root_sequence}
+        """

From 18e0b9b989eff82f9ac076b32eab4665d8694626 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 14:50:23 -0800
Subject: [PATCH 17/28] ingest: consolidate source-data with config

Part of work to update this repo to match the pathogen-repo-template.

We are removing the `source-data` directory and consolidating everything
into the config directory since there's really no functional difference
between them.

https://github.com/nextstrain/mpox/commit/c34f370cc5b305c089032b9714c7adeecc9cec94
---
 ingest/{source-data => config}/annotations.tsv       | 0
 ingest/config/config.yaml                            | 4 ++--
 ingest/{source-data => config}/geolocation-rules.tsv | 0
 3 files changed, 2 insertions(+), 2 deletions(-)
 rename ingest/{source-data => config}/annotations.tsv (100%)
 rename ingest/{source-data => config}/geolocation-rules.tsv (100%)

diff --git a/ingest/source-data/annotations.tsv b/ingest/config/annotations.tsv
similarity index 100%
rename from ingest/source-data/annotations.tsv
rename to ingest/config/annotations.tsv
diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
index 927bd75..e7bd353 100644
--- a/ingest/config/config.yaml
+++ b/ingest/config/config.yaml
@@ -76,9 +76,9 @@ transform:
   geolocation_rules_url: 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv'
   # Local geolocation rules that are only applicable to zika data
   # Local rules can overwrite the general geolocation rules provided above
-  local_geolocation_rules: 'source-data/geolocation-rules.tsv'
+  local_geolocation_rules: 'config/geolocation-rules.tsv'
   # User annotations file
-  annotations: 'source-data/annotations.tsv'
+  annotations: 'config/annotations.tsv'
   # ID field used to merge annotations
   annotations_id: 'genbank_accession'
   # Field to use as the sequence ID in the FASTA file
diff --git a/ingest/source-data/geolocation-rules.tsv b/ingest/config/geolocation-rules.tsv
similarity index 100%
rename from ingest/source-data/geolocation-rules.tsv
rename to ingest/config/geolocation-rules.tsv

From 3124b187c4679d35554f5d9a780a65365571ceb8 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Mon, 18 Dec 2023 14:58:00 -0800
Subject: [PATCH 18/28] ingest: Always provide default config values

Part of work to update this repo to match the pathogen-repo-template.

Always provide default config values that can then easily be overridden
with --configfiles/--config options. This makes it simpler to change a
subset of config values or extend the default configs.

Renames the config.yaml to defaults.yaml to reflect this change.

https://github.com/nextstrain/mpox/commit/1e01fedb5fa965124dabb0cafef4aa94b298c44f
---
 ingest/Snakefile                             | 4 ++--
 ingest/config/{config.yaml => defaults.yaml} | 0
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename ingest/config/{config.yaml => defaults.yaml} (100%)

diff --git a/ingest/Snakefile b/ingest/Snakefile
index bfc99b0..56848c4 100644
--- a/ingest/Snakefile
+++ b/ingest/Snakefile
@@ -5,8 +5,8 @@ min_version(
 )  # Snakemake 7.7.0 introduced `retries` directive used in fetch-sequences
 
 if not config:
-
-    configfile: "config/config.yaml"
+    # Use default configuration values. Override with Snakemake's --configfile/--config options.
+    configfile: "config/defaults.yaml"
 
 
 send_slack_notifications = config.get("send_slack_notifications", False)
diff --git a/ingest/config/config.yaml b/ingest/config/defaults.yaml
similarity index 100%
rename from ingest/config/config.yaml
rename to ingest/config/defaults.yaml

From f388525eb3ccf90b02d90b6da702425d3287209d Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Wed, 10 Jan 2024 15:20:31 -0800
Subject: [PATCH 19/28] Use a rules folder for ingest to follow
 pathogen-repo-template

---
 ingest/Snakefile                                       | 10 +++++-----
 .../snakemake_rules => rules}/fetch_sequences.smk      |  0
 .../snakemake_rules => rules}/slack_notifications.smk  |  0
 .../{workflow/snakemake_rules => rules}/transform.smk  |  0
 .../snakemake_rules => rules}/trigger_rebuild.smk      |  0
 ingest/{workflow/snakemake_rules => rules}/upload.smk  |  0
 6 files changed, 5 insertions(+), 5 deletions(-)
 rename ingest/{workflow/snakemake_rules => rules}/fetch_sequences.smk (100%)
 rename ingest/{workflow/snakemake_rules => rules}/slack_notifications.smk (100%)
 rename ingest/{workflow/snakemake_rules => rules}/transform.smk (100%)
 rename ingest/{workflow/snakemake_rules => rules}/trigger_rebuild.smk (100%)
 rename ingest/{workflow/snakemake_rules => rules}/upload.smk (100%)

diff --git a/ingest/Snakefile b/ingest/Snakefile
index 56848c4..453a327 100644
--- a/ingest/Snakefile
+++ b/ingest/Snakefile
@@ -55,20 +55,20 @@ rule all:
         _get_all_targets,
 
 
-include: "workflow/snakemake_rules/fetch_sequences.smk"
-include: "workflow/snakemake_rules/transform.smk"
+include: "rules/fetch_sequences.smk"
+include: "rules/transform.smk"
 
 
 if config.get("upload", False):
 
-    include: "workflow/snakemake_rules/upload.smk"
+    include: "rules/upload.smk"
 
 
 if send_slack_notifications:
 
-    include: "workflow/snakemake_rules/slack_notifications.smk"
+    include: "rules/slack_notifications.smk"
 
 
 if config.get("trigger_rebuild", False):
 
-    include: "workflow/snakemake_rules/trigger_rebuild.smk"
+    include: "rules/trigger_rebuild.smk"
diff --git a/ingest/workflow/snakemake_rules/fetch_sequences.smk b/ingest/rules/fetch_sequences.smk
similarity index 100%
rename from ingest/workflow/snakemake_rules/fetch_sequences.smk
rename to ingest/rules/fetch_sequences.smk
diff --git a/ingest/workflow/snakemake_rules/slack_notifications.smk b/ingest/rules/slack_notifications.smk
similarity index 100%
rename from ingest/workflow/snakemake_rules/slack_notifications.smk
rename to ingest/rules/slack_notifications.smk
diff --git a/ingest/workflow/snakemake_rules/transform.smk b/ingest/rules/transform.smk
similarity index 100%
rename from ingest/workflow/snakemake_rules/transform.smk
rename to ingest/rules/transform.smk
diff --git a/ingest/workflow/snakemake_rules/trigger_rebuild.smk b/ingest/rules/trigger_rebuild.smk
similarity index 100%
rename from ingest/workflow/snakemake_rules/trigger_rebuild.smk
rename to ingest/rules/trigger_rebuild.smk
diff --git a/ingest/workflow/snakemake_rules/upload.smk b/ingest/rules/upload.smk
similarity index 100%
rename from ingest/workflow/snakemake_rules/upload.smk
rename to ingest/rules/upload.smk

From efe11e381ca777ac7e784d7d04cce0d04fd6fbfe Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Wed, 10 Jan 2024 15:31:37 -0800
Subject: [PATCH 20/28] Update the CI workflow

Use the stopgap from mpox (nextstrain/mpox#214) until the pathogen-repo-ci is updated.
However, we may be moving to using config/customization/ci in the future, so revisit
this commit and edit accordingly.

https://github.com/nextstrain/pathogen-repo-template/pull/24
---
 .github/workflows/ci.yaml                     | 23 ++++++++++++++++---
 phylogenetic/Snakefile                        |  9 ++++++--
 .../profiles/ci/copy_example_data.smk         | 21 +++++++++++++++++
 phylogenetic/profiles/ci/profiles_config.yaml |  2 ++
 4 files changed, 50 insertions(+), 5 deletions(-)
 create mode 100644 phylogenetic/profiles/ci/copy_example_data.smk
 create mode 100644 phylogenetic/profiles/ci/profiles_config.yaml

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index 6fa98cd..12ce718 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -5,7 +5,24 @@ on:
   - pull_request
 
 jobs:
-  ci:
-    uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@dec0880059017dac7facf100435c5737bf1386c8
+  pathogen-ci:
+    strategy:
+      matrix:
+        runtime: [docker, conda]
+    permissions:
+      id-token: write
+    uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
+    secrets: inherit
     with:
-      workflow-root: phylogenetic
+      runtime: ${{ matrix.runtime }}
+      run: |
+        nextstrain build \
+          phylogenetic \
+          --configfile profiles/ci/profiles_config.yaml
+      artifact-name: output-${{ matrix.runtime }}
+      artifact-paths: |
+        phylogenetic/auspice/
+        phylogenetic/results/
+        phylogenetic/benchmarks/
+        phylogenetic/logs/
+        phylogenetic/.snakemake/log/
\ No newline at end of file
diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index d599360..30c4777 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -1,5 +1,4 @@
-if not config:
-    configfile: "config/config_zika.yaml"
+configfile: "config/config_zika.yaml"
 
 rule all:
     input:
@@ -22,6 +21,12 @@ include: "rules/construct_phylogeny.smk"
 include: "rules/annotate_phylogeny.smk"
 include: "rules/export.smk"
 
+# Include custom rules defined in the config.
+if "custom_rules" in config:
+    for rule_file in config["custom_rules"]:
+
+        include: rule_file
+
 rule clean:
     """Removing directories: {params}"""
     params:
diff --git a/phylogenetic/profiles/ci/copy_example_data.smk b/phylogenetic/profiles/ci/copy_example_data.smk
new file mode 100644
index 0000000..495d778
--- /dev/null
+++ b/phylogenetic/profiles/ci/copy_example_data.smk
@@ -0,0 +1,21 @@
+rule copy_example_data:
+    input:
+        sequences="example_data/sequences.fasta",
+        metadata="example_data/metadata.tsv",
+        usvi_sequences="example_data/sequences_usvi.fasta",
+        usvi_metadata="example_data/metadata_usvi.tsv",
+    output:
+        sequences="data/sequences.fasta",
+        metadata="data/metadata.tsv",
+        usvi_sequences="data/sequences_usvi.fasta",
+        usvi_metadata="data/metadata_usvi.tsv",
+    shell:
+        """
+        cp -f {input.sequences} {output.sequences}
+        cp -f {input.metadata} {output.metadata}
+        cp -f {input.usvi_sequences} {output.usvi_sequences}
+        cp -f {input.usvi_metadata} {output.usvi_metadata}
+        """
+
+ruleorder: copy_example_data > decompress
+ruleorder: copy_example_data > decompress_usvi
\ No newline at end of file
diff --git a/phylogenetic/profiles/ci/profiles_config.yaml b/phylogenetic/profiles/ci/profiles_config.yaml
new file mode 100644
index 0000000..17bad21
--- /dev/null
+++ b/phylogenetic/profiles/ci/profiles_config.yaml
@@ -0,0 +1,2 @@
+custom_rules:
+  - profiles/ci/copy_example_data.smk

From 6349fd76b4b9908159eb3934b96e68ea0cf0878e Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Wed, 10 Jan 2024 16:05:54 -0800
Subject: [PATCH 21/28] Define input paths with literal path strings

https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#define-input-paths-with-literal-path-strings
---
 phylogenetic/Snakefile                    | 11 -----------
 phylogenetic/rules/annotate_phylogeny.smk |  2 +-
 phylogenetic/rules/export.smk             |  6 +++---
 phylogenetic/rules/prepare_sequences.smk  |  4 ++--
 4 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 30c4777..d40f00e 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -4,17 +4,6 @@ rule all:
     input:
         auspice_json = "auspice/zika.json",
 
-rule files:
-    params:
-        input_fasta = "data/zika.fasta",
-        dropped_strains = "config/dropped_strains.txt",
-        reference = "config/zika_reference.gb",
-        colors = "config/colors.tsv",
-        auspice_config = "config/auspice_config.json",
-        description = "config/description.md"
-
-files = rules.files.params
-
 include: "rules/usvi.smk"
 include: "rules/prepare_sequences.smk"
 include: "rules/construct_phylogeny.smk"
diff --git a/phylogenetic/rules/annotate_phylogeny.smk b/phylogenetic/rules/annotate_phylogeny.smk
index 257915c..ad00d0f 100644
--- a/phylogenetic/rules/annotate_phylogeny.smk
+++ b/phylogenetic/rules/annotate_phylogeny.smk
@@ -54,7 +54,7 @@ rule translate:
     input:
         tree = "results/tree.nwk",
         node_data = "results/nt_muts.json",
-        reference = files.reference
+        reference = "config/zika_reference.gb"
     output:
         node_data = "results/aa_muts.json"
     shell:
diff --git a/phylogenetic/rules/export.smk b/phylogenetic/rules/export.smk
index 7dbe431..e44b8d5 100644
--- a/phylogenetic/rules/export.smk
+++ b/phylogenetic/rules/export.smk
@@ -34,9 +34,9 @@ rule export:
         traits = "results/traits.json",
         nt_muts = "results/nt_muts.json",
         aa_muts = "results/aa_muts.json",
-        colors = files.colors,
-        auspice_config = files.auspice_config,
-        description = files.description
+        colors = "config/colors.tsv",
+        auspice_config = "config/auspice_config.json",
+        description = "config/description.md"
     output:
         auspice_json = "results/raw_zika.json",
         root_sequence = "results/raw_zika_root-sequence.json",
diff --git a/phylogenetic/rules/prepare_sequences.smk b/phylogenetic/rules/prepare_sequences.smk
index 255b87b..2a11420 100644
--- a/phylogenetic/rules/prepare_sequences.smk
+++ b/phylogenetic/rules/prepare_sequences.smk
@@ -60,7 +60,7 @@ rule filter:
     input:
         sequences = "data/sequences_all.fasta",
         metadata = "data/metadata_all.tsv",
-        exclude = files.dropped_strains
+        exclude = "config/dropped_strains.txt",
     output:
         sequences = "results/filtered.fasta"
     params:
@@ -90,7 +90,7 @@ rule align:
     """
     input:
         sequences = "results/filtered.fasta",
-        reference = files.reference
+        reference = "config/zika_reference.gb"
     output:
         alignment = "results/aligned.fasta"
     shell:

From 44825a3a700326a7dfd3b41a3241f60c3db6653c Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Fri, 12 Jan 2024 10:50:07 -0800
Subject: [PATCH 22/28] Copy contributing docs from mpox

---
 CONTRIBUTING.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)
 create mode 100644 CONTRIBUTING.md

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000..a82417d
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,50 @@
+# Developer guide
+
+## CI
+
+Checks are automatically run on certain pushed commits for testing and linting
+purposes. Some are defined by [.github/workflows/ci.yaml][] while others are
+configured outside of this repository.
+
+[.github/workflows/ci.yaml]: ./.github/workflows/ci.yaml
+
+## Pre-commit
+
+[pre-commit][] is used for various checks (see [configuration][]).
+
+You can either [install it yourself][] to catch issues before pushing or look
+for the [pre-commit.ci run][] after pushing.
+
+[pre-commit]: https://pre-commit.com/
+[configuration]: ./.pre-commit-config.yaml
+[install it yourself]: https://pre-commit.com/#install
+[pre-commit.ci run]: https://results.pre-commit.ci/repo/github/493877605
+
+## Snakemake formatting
+
+We use [`snakefmt`](https://github.com/snakemake/snakefmt) to ensure consistency in style across Snakemake files in this project.
+
+### Installing
+
+- Using mamba/bioconda:
+
+```bash
+mamba install -c bioconda snakefmt
+```
+
+- Using pip:
+
+```bash
+pip install snakefmt
+```
+
+### IDE-independent
+
+1. Check for styling issues with `snakefmt --check .`
+1. Automatically fix styling issues with `snakefmt .`
+
+### Using VSCode extension
+
+1. Install the [VSCode extension](https://marketplace.visualstudio.com/items?itemName=tfehlmann.snakefmt)
+1. Check for styling issues with `Ctrl+Shift+P` and select `snakefmt: Check`
+1. Automatically fix styling issues with `Ctrl+Shift+P` and select `Format document`

From aefdec195a40733e40e205a620e4537a0b3ccc64 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Fri, 12 Jan 2024 11:04:50 -0800
Subject: [PATCH 23/28] Simplify README instructions

---
 phylogenetic/README.md | 91 +++++++++++++-----------------------------
 1 file changed, 28 insertions(+), 63 deletions(-)

diff --git a/phylogenetic/README.md b/phylogenetic/README.md
index 568bb03..5d831e7 100644
--- a/phylogenetic/README.md
+++ b/phylogenetic/README.md
@@ -3,42 +3,25 @@
 This is the [Nextstrain](https://nextstrain.org) build for Zika, visible at
 [nextstrain.org/zika](https://nextstrain.org/zika).
 
-The build encompasses fetching data, preparing it for analysis, doing quality
-control, performing analyses, and saving the results in a format suitable for
-visualization (with [auspice][]).  This involves running components of
-Nextstrain such as [fauna][] and [augur][].
+## Software requirements
 
-All Zika-specific steps and functionality for the Nextstrain pipeline should be
-housed in this repository.
-
-_This build requires Augur v6._
-
-[![Build Status](https://github.com/nextstrain/zika/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/nextstrain/zika/actions/workflows/ci.yaml)
+Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html) for Nextstrain's suite of software tools.
 
 ## Usage
 
 If you're unfamiliar with Nextstrain builds, you may want to follow our
-[quickstart guide][] first and then come back here.
+[Running a Pathogen Workflow guide][] first and then come back here.
 
-There are two main ways to run & visualise the output from this build:
+The easiest way to run this pathogen build is using the Nextstrain
+command-line tool:
 
-The first, and easiest, way to run this pathogen build is using the [Nextstrain
-command-line tool][nextstrain-cli]:
-```
-nextstrain build . 
-nextstrain view auspice/
-```
+    nextstrain build .
 
-See the [nextstrain-cli README][] for how to install the `nextstrain` command.
+Build output goes into the directories `data/`, `results/` and `auspice/`.
 
-The second is to install augur & auspice using conda, following [these instructions](https://nextstrain.org/docs/getting-started/local-installation#install-augur--auspice-with-conda-recommended).
-The build may then be run via:
-```
-snakemake
-auspice --datasetDir auspice/
-```
+Once you've run the build, you can view the results in auspice:
 
-Build output goes into the directories `data/`, `results/` and `auspice/`.
+    nextstrain view auspice/
 
 ## Configuration
 
@@ -46,43 +29,25 @@ Configuration takes place entirely with the `Snakefile`. This can be read top-to
 specifies its file inputs and output and also its parameters. There is little redirection and each
 rule should be able to be reasoned with on its own.
 
+### Using GenBank data
+
+This build starts by pulling preprocessed sequence and metadata files from: 
+
+* https://data.nextstrain.org/files/zika/sequences.fasta.zst
+* https://data.nextstrain.org/files/zika/metadata.tsv.zst
+
+The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals.
+
+### Using example data
+
+Alternatively, you can run the build using the
+example data provided in this repository.  To run the build by copying the
+example sequences into the `data/` directory, use the following:
 
-## Input data
-
-This build starts by downloading sequences from
-https://data.nextstrain.org/files/zika/sequences.fasta.xz
-and metadata from
-https://data.nextstrain.org/files/zika/metadata.tsv.gz.
-These are publicly provisioned data by the Nextstrain team by pulling sequences
-from NCBI GenBank via ViPR and performing 
-[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).
-
-Data from GenBank follows Open Data principles, such that we can make input data
-and intermediate files available for further analysis. Open Data is data that
-can be freely used, re-used and redistributed by anyone - subject only, at most,
-to the requirement to attribute and sharealike.
-
-We gratefully acknowledge the authors, originating and submitting laboratories
-of the genetic sequences and metadata for sharing their work in open databases.
-Please note that although data generators have generously shared data in an open
-fashion, that does not mean there should be free license to publish on this
-data. Data generators should be cited where possible and collaborations should
-be sought in some circumstances. Please try to avoid scooping someone else's
-work. Reach out if uncertain. Authors, paper references (where available) and
-links to GenBank entries are provided in the metadata file.
-
-A faster build process can be run working from example data by copying over
-sequences and metadata from `example_data/` to `data/` via:
-```
-mkdir -p data/
-cp -v example_data/* data/
-```
+    nextstrain build .  --configfile profiles/ci/profiles_config.yaml
 
 [Nextstrain]: https://nextstrain.org
-[fauna]: https://github.com/nextstrain/fauna
-[augur]: https://github.com/nextstrain/augur
-[auspice]: https://github.com/nextstrain/auspice
-[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options
-[nextstrain-cli]: https://github.com/nextstrain/cli
-[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md
-[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart
+[augur]: https://docs.nextstrain.org/projects/augur/en/stable/
+[auspice]: https://docs.nextstrain.org/projects/auspice/en/stable/index.html
+[Installing Nextstrain guide]: https://docs.nextstrain.org/en/latest/install.html
+[Running a Pathogen Workflow guide]: https://docs.nextstrain.org/en/latest/tutorials/running-a-workflow.html

From b734c786d7cb030aa22912cac0347876174b4d0c Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Thu, 18 Jan 2024 07:50:22 -0800
Subject: [PATCH 24/28] Refactor post-processing script to be specific to zika
 strain name fixes

This commit refactors the generic post-processing script to better align
with its specific purpose in Zika ingest. The purpose of this script is
to fix zika strain names based on historical modifications from the fauna
repo. In summary the following changes:

* Rename script to fix-zika-strain-names.py to match the purpose
* Add a docstring to the script
* Replace the accession argument with a strain field argument, which is
  the field that needs to be fixed
---
 ...post_process_metadata.py => fix-zika-strain-names.py} | 9 ++++-----
 ingest/rules/transform.smk                               | 3 +--
 2 files changed, 5 insertions(+), 7 deletions(-)
 rename ingest/bin/{post_process_metadata.py => fix-zika-strain-names.py} (85%)

diff --git a/ingest/bin/post_process_metadata.py b/ingest/bin/fix-zika-strain-names.py
similarity index 85%
rename from ingest/bin/post_process_metadata.py
rename to ingest/bin/fix-zika-strain-names.py
index 3c587e5..6c46bf5 100755
--- a/ingest/bin/post_process_metadata.py
+++ b/ingest/bin/fix-zika-strain-names.py
@@ -8,10 +8,10 @@
 
 def parse_args():
     parser = argparse.ArgumentParser(
-        description="Reformat a NCBI Virus metadata.tsv file for a pathogen build."
+        description="Modify zika strain names by referencing historical modifications from the fauna repo."
     )
-    parser.add_argument("--accession-field", default='accession',
-        help="Field from the records to use as the sequence ID in the FASTA file.")
+    parser.add_argument("--strain-field", default='strain',
+        help="Field from the records to use as the strain name to be fixed.")
 
     return parser.parse_args()
 
@@ -48,8 +48,7 @@ def main():
 
     for index, record in enumerate(stdin):
         record = json.loads(record)
-        record["strain"] = _set_strain_name(record)
-        record["authors"] = record["abbr_authors"]
+        record[args.strain_field] = _set_strain_name(record)
         stdout.write(json.dumps(record) + "\n")
 
 
diff --git a/ingest/rules/transform.smk b/ingest/rules/transform.smk
index a0891e5..cc4e917 100644
--- a/ingest/rules/transform.smk
+++ b/ingest/rules/transform.smk
@@ -85,8 +85,7 @@ rule transform:
                 --abbr-authors-field {params.abbr_authors_field} \
             | ./vendored/apply-geolocation-rules \
                 --geolocation-rules {input.all_geolocation_rules} \
-            | ./bin/post_process_metadata.py \
-                --accession-field {params.id_field} \
+            | ./bin/fix-zika-strain-names.py \
             | ./vendored/merge-user-metadata \
                 --annotations {input.annotations} \
                 --id-field {params.annotations_id} \

From a99af297b42642c9e1524561e8042c9c8063ee96 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Thu, 18 Jan 2024 08:00:13 -0800
Subject: [PATCH 25/28] Abbreviate authors inplace instead of a separate field

---
 ingest/rules/transform.smk | 1 -
 1 file changed, 1 deletion(-)

diff --git a/ingest/rules/transform.smk b/ingest/rules/transform.smk
index cc4e917..2860fdd 100644
--- a/ingest/rules/transform.smk
+++ b/ingest/rules/transform.smk
@@ -82,7 +82,6 @@ rule transform:
             | ./vendored/transform-authors \
                 --authors-field {params.authors_field} \
                 --default-value {params.authors_default_value} \
-                --abbr-authors-field {params.abbr_authors_field} \
             | ./vendored/apply-geolocation-rules \
                 --geolocation-rules {input.all_geolocation_rules} \
             | ./bin/fix-zika-strain-names.py \

From 230c6b023885b99d310625e888596ebcfd60975a Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Thu, 18 Jan 2024 08:05:46 -0800
Subject: [PATCH 26/28] As discussed, always use the default config

---
 ingest/Snakefile | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/ingest/Snakefile b/ingest/Snakefile
index 453a327..fb26cc1 100644
--- a/ingest/Snakefile
+++ b/ingest/Snakefile
@@ -4,9 +4,8 @@ min_version(
     "7.7.0"
 )  # Snakemake 7.7.0 introduced `retries` directive used in fetch-sequences
 
-if not config:
-    # Use default configuration values. Override with Snakemake's --configfile/--config options.
-    configfile: "config/defaults.yaml"
+# Use default configuration values. Override with Snakemake's --configfile/--config options.
+configfile: "config/defaults.yaml"
 
 
 send_slack_notifications = config.get("send_slack_notifications", False)

From 8cf73262fe266286460c83ea851f9d7e2a9a2b46 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Thu, 18 Jan 2024 08:10:59 -0800
Subject: [PATCH 27/28] Update Contributing docs

* Rephrase the CI section to be less likely to get out of sync
* Drop the snakefmt section since we are not currently using it. The
  snakefmt may be added back in the future.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
---
 CONTRIBUTING.md | 45 +--------------------------------------------
 1 file changed, 1 insertion(+), 44 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index a82417d..2ff4409 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,49 +2,6 @@
 
 ## CI
 
-Checks are automatically run on certain pushed commits for testing and linting
-purposes. Some are defined by [.github/workflows/ci.yaml][] while others are
-configured outside of this repository.
+Tests are run through GitHub Actions when triggered by events as defined by [.github/workflows/ci.yaml][]
 
 [.github/workflows/ci.yaml]: ./.github/workflows/ci.yaml
-
-## Pre-commit
-
-[pre-commit][] is used for various checks (see [configuration][]).
-
-You can either [install it yourself][] to catch issues before pushing or look
-for the [pre-commit.ci run][] after pushing.
-
-[pre-commit]: https://pre-commit.com/
-[configuration]: ./.pre-commit-config.yaml
-[install it yourself]: https://pre-commit.com/#install
-[pre-commit.ci run]: https://results.pre-commit.ci/repo/github/493877605
-
-## Snakemake formatting
-
-We use [`snakefmt`](https://github.com/snakemake/snakefmt) to ensure consistency in style across Snakemake files in this project.
-
-### Installing
-
-- Using mamba/bioconda:
-
-```bash
-mamba install -c bioconda snakefmt
-```
-
-- Using pip:
-
-```bash
-pip install snakefmt
-```
-
-### IDE-independent
-
-1. Check for styling issues with `snakefmt --check .`
-1. Automatically fix styling issues with `snakefmt .`
-
-### Using VSCode extension
-
-1. Install the [VSCode extension](https://marketplace.visualstudio.com/items?itemName=tfehlmann.snakefmt)
-1. Check for styling issues with `Ctrl+Shift+P` and select `snakefmt: Check`
-1. Automatically fix styling issues with `Ctrl+Shift+P` and select `Format document`

From 05606869b958bfa6cb341fa9efddf386f94c6557 Mon Sep 17 00:00:00 2001
From: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
Date: Thu, 18 Jan 2024 08:55:28 -0800
Subject: [PATCH 28/28] fixup: add docstring

---
 ingest/bin/fix-zika-strain-names.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/ingest/bin/fix-zika-strain-names.py b/ingest/bin/fix-zika-strain-names.py
index 6c46bf5..5353ae8 100755
--- a/ingest/bin/fix-zika-strain-names.py
+++ b/ingest/bin/fix-zika-strain-names.py
@@ -1,4 +1,10 @@
 #! /usr/bin/env python3
+"""
+Parses GenBank's 'strain' field of the NDJSON record from stdin and applies Zika-specific strain name corrections
+based on historical modifications from the fauna repo.
+
+Outputs the modified record to stdout.
+"""
 
 import argparse
 import json