Skip to content

Commit

Permalink
1.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
bede committed Jan 22, 2024
1 parent 6b151ed commit 247ae42
Show file tree
Hide file tree
Showing 6 changed files with 72 additions and 41 deletions.
59 changes: 33 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,25 @@

# Hostile

Hostile accurately removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an [order of magnitude fewer microbial reads](https://log.bede.im/2023/08/29/precise-host-read-removal.html#evaluating-accuracy) than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For the best possible retention of microbial reads, one can use a reference masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. Further information and benchmarks can be found in the [paper](https://doi.org/10.1093/bioinformatics/btad728) and [blog post](https://log.bede.im/2023/08/29/precise-host-read-removal.html). Please open an issue to report problems [or](mailto:[email protected]) [otherwise](https://twitter.com/beconsta) [reach](https://bsky.app/profile/bedec.bsky.social) [out](https://mstdn.science/@bede) for help, advice etc.
Hostile accurately removes host sequences from short and long read (meta)genomes, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an [order of magnitude fewer microbial reads](https://log.bede.im/2023/08/29/precise-host-read-removal.html#evaluating-accuracy) than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For the best possible retention of microbial reads, use an existing index masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. Further information and benchmarks can be found in the [paper](https://doi.org/10.1093/bioinformatics/btad728) and [blog post](https://log.bede.im/2023/08/29/precise-host-read-removal.html). Please open an issue to report problems [or](mailto:[email protected]) [otherwise](https://twitter.com/beconsta) [reach](https://bsky.app/profile/bedec.bsky.social) [out](https://mstdn.science/@bede) for help, advice etc.



## Reference genomes (indexes)

The default index `human-t2t-hla` comprises [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) and [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51, and is downloaded automatically when running Hostile unless another index is specified. Slightly higher microbial sequence retention is possible using a masked index, of which several are available. The index `human-t2t-hla-argos985` is masked against [985 reference grade bacterial genomes](https://www.ncbi.nlm.nih.gov/bioproject/231221) including common human pathogens, while `human-t2t-hla-argos985-mycob140` is further masked against mycobacterial genomes. To use a standard index, simply pass its name as the value of the `--index` argument which takes care of downloading and cacheing the relevant index. Automatic download can be disabled using the `--offline` flag, and `--index` can accept a path to a custom reference genome or Bowtie2 index. [Object storage](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o) is provided by the [ModMedMicro research group](https://www.expmedndm.ox.ac.uk/modernising-medical-microbiology).

| Name | Composition | Date | Masked positions |
| :----------------------------------------------------------: | :----------------------------------------------------------: | ------- | --------------------- |
| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | 2023-07 | 0 (**0%**) |
| `human-t2t-hla-argos985` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial genomes | 2023-07 | 317,973 (**0.01%**) |
| `human-t2t-hla-argos985-mycob140` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial genomes | 2023-07 | 319,752 (**0.01%**) |
| `human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial, 18,719 RefSeq viral, and 26,928 [Millard Lab phage](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,473,260 (**0.04%**) |
| Name | Composition | Date | Masked positions |
| :----------------------------------------------------------: | :----------------------------------------------------------: | ------- | ---------------------- |
| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | 2023-07 | 0 (**0%**) |
| `human-t2t-hla-argos985` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) **bacterial** genomes | 2023-07 | 317,973 (**0.010%**) |
| `human-t2t-hla.rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for 18,719 RefSeq **viral** and 26,928 [Millard Lab **phage**](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,172,993 (**0.037%**) |
| `human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial, 18,719 RefSeq **viral**, and 26,928 [Millard Lab **phage**](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,473,260 (**0.046%**) |
| `human-t2t-hla-argos985-mycob140` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) **bacterial** & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) **mycobacterial** genomes | 2023-07 | 319,752 (**0.010%**) |

*Performance of `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` was evaluated in the [paper](https://doi.org/10.1093/bioinformatics/btad728)*


*Performance of `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` was evaluated in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735)*

## Install [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile)

Expand All @@ -44,27 +47,16 @@ wget https://raw.githubusercontent.com/bede/hostile/main/Dockerfile
docker build . --platform linux/amd64
```

**Development install**

```bash
git clone https://github.com/bede/hostile.git
cd hostile
conda env create -y -f environment.yml
conda activate hostile
pip install --editable '.[dev]'
pytest
```



## Index installation
## Index installation (optional)

This is optional since Hostile automatically downloads and caches the default index `human-t2t-hla` when run for the first time.
Hostile automatically downloads and caches the default index `human-t2t-hla` when run for the first time, meaning that there is no need to download an index in advance. Neverthless:

- To force Hostile to download the default index (`human-t2t-hla`), run `hostile fetch`
- To download and cache the default index (`human-t2t-hla`), run `hostile fetch`
- To list available indexes, run `hostile fetch --list`
- To download another standard index, run e.g. `hostile fetch --name human-t2t-hla-argos985`
- To use a custom genome (made perhaps with `hostile mask`?), run `hostile clean --index path/to/genome.fa`
- To download and cache another standard index, run e.g. `hostile fetch --name human-t2t-hla-argos985`
- To use a custom genome (made perhaps with `hostile mask`), run `hostile clean --index path/to/genome.fa`



Expand Down Expand Up @@ -228,16 +220,31 @@ You may wish to use one of the existing [reference genomes](#reference-genomes--
Bede Constantinides, Martin Hunt, Derrick W Crook, Hostile: accurate decontamination of microbial host sequences, *Bioinformatics*, 2023; btad728, https://doi.org/10.1093/bioinformatics/btad728
```latex
@article{Constantinides2023,
@article{10.1093/bioinformatics/btad728,
author = {Constantinides, Bede and Hunt, Martin and Crook, Derrick W},
title = {Hostile: accurate decontamination of microbial host sequences},
journal = {Bioinformatics},
volume = {39},
number = {12},
pages = {btad728},
year = {2023},
month = {12},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad728},
url = {https://doi.org/10.1093/bioinformatics/btad728},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad728/53970806/btad728.pdf},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/12/btad728/54850422/btad728.pdf},
}
```
## Development install
```bash
git clone https://github.com/bede/hostile.git
cd hostile
conda env create -y -f environment.yml
conda activate hostile
pip install --editable '.[dev]'
pytest
```
14 changes: 12 additions & 2 deletions bucket/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,23 @@
}
}
},
"human-t2t-hla.rs-viral-202401_ml-phage-202401": {
"assets": {
"human-t2t-hla.rs-viral-202401_ml-phage-202401.fa.gz": {
"sha256": "3f95be8ffd99a0ebc7667af2dc699ac7921bd474b5edf200fe319b13f98b0265"
},
"human-t2t-hla.rs-viral-202401_ml-phage-202401.tar": {
"sha256": "d1d5386e65f2e9006c09cfd3f983499e07e968ce3a6332b5d546bbd09088af3e"
}
}
},
"human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401": {
"assets": {
"human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401.fa.gz": {
"sha256": ""
"sha256": "a9693618eaf93fae24e07c184019ca96f81cd1ec098858f3f4c901c706efa8f0"
},
"human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401.tar": {
"sha256": ""
"sha256": "681819bec809f2e3b613d449492954baa018d032879e3acfbcc78c5568009410"
}
}
}
Expand Down
8 changes: 2 additions & 6 deletions src/hostile/aligner.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ def check_index(self, index: str, offline: bool = False) -> Path:
manifest = util.fetch_manifest(util.BUCKET_URL)
with tempfile.NamedTemporaryFile() as temporary_file:
tmp_path = Path(temporary_file.name)
# util.download(f"http://localhost:8000/{file_name}", tmp_path)
util.download(f"{file_url}", tmp_path)
expected_sha256 = manifest[index]["assets"][file_name]["sha256"]
logging.info(f"Verifying checksum {expected_sha256}…")
Expand Down Expand Up @@ -74,7 +73,6 @@ def check_index(self, index: str, offline: bool = False) -> Path:
manifest = util.fetch_manifest(util.BUCKET_URL)
with tempfile.NamedTemporaryFile() as temporary_file:
tmp_path = Path(temporary_file.name)
# util.download(f"http://localhost:8000/{file_name}", tmp_path)
util.download(f"{file_url}", tmp_path)
expected_sha256 = manifest[index]["assets"][file_name]["sha256"]
logging.info(f"Verifying checksum {expected_sha256}…")
Expand All @@ -97,7 +95,7 @@ def gen_clean_cmd(
self,
fastq: Path,
out_dir: Path,
index: str,
index_path: Path,
invert: bool,
rename: bool,
reorder: bool,
Expand All @@ -116,7 +114,6 @@ def gen_clean_cmd(
raise FileExistsError(
f"Output file already exists. Use --force to overwrite"
)
index_path = self.check_index(index, offline=offline)
filter_cmd = " | samtools view -hF 4 -" if invert else " | samtools view -f 4 -"
reorder_cmd = " | samtools sort -n -O sam -@ 6 -m 1G" if reorder else ""
rename_cmd = (
Expand Down Expand Up @@ -159,7 +156,7 @@ def gen_paired_clean_cmd(
fastq1: Path,
fastq2: Path,
out_dir: Path,
index: Path | None,
index_path: Path,
invert: bool,
rename: bool,
reorder: bool,
Expand All @@ -180,7 +177,6 @@ def gen_paired_clean_cmd(
raise FileExistsError(
f"Output files already exist. Use --force to overwrite"
)
index_path = self.check_index(index, offline=offline)
filter_cmd = (
" | samtools view -hF 12 -" if invert else " | samtools view -f 12 -"
)
Expand Down
4 changes: 1 addition & 3 deletions src/hostile/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,9 +129,7 @@ def fetch(
list: bool = False,
) -> None:
"""
Download indexes from object storage for use with hostile clean. For Minimap2 a
compressed genome is downloaded, while for Bowtie2 a tarred index is downloaded
and extracted
Download and cache indexes from object storage for use with hostile clean
:arg filename: filename of index to download
:arg aligner: aligner(s) for which to download an index
Expand Down
10 changes: 6 additions & 4 deletions src/hostile/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,12 @@ def clean_fastqs(
if not all(fastq.is_file() for fastq in fastqs):
raise FileNotFoundError("One or more fastq files do not exist")
Path(out_dir).mkdir(exist_ok=True, parents=True)
index_path = aligner.value.check_index(index, offline=offline)
backend_cmds = [
aligner.value.gen_clean_cmd(
fastq=fastq,
out_dir=out_dir,
index=index,
index_path=index_path,
invert=invert,
rename=rename,
reorder=reorder,
Expand Down Expand Up @@ -233,12 +234,13 @@ def clean_paired_fastqs(
if not all(path.is_file() for fastq_pair in fastqs for path in fastq_pair):
raise FileNotFoundError("One or more fastq files do not exist")
Path(out_dir).mkdir(exist_ok=True, parents=True)
index_path = aligner.value.check_index(index, offline=offline)
backend_cmds = [
aligner.value.gen_paired_clean_cmd(
fastq1=pair[0],
fastq2=pair[1],
out_dir=out_dir,
index=index,
index_path=index_path,
invert=invert,
rename=rename,
reorder=reorder,
Expand Down Expand Up @@ -279,8 +281,8 @@ def mask(
out_dir.mkdir(exist_ok=True, parents=True)
kmers_path = out_dir / "kmers.fasta.gz"
mask_path = out_dir / "mask.bed"
masked_ref_path = out_dir / "masked.fa"
masked_ref_index_path = out_dir / "masked"
masked_ref_path = out_dir / f"{out_dir.name}.fa"
masked_ref_index_path = out_dir / out_dir.name

if ref_path.suffix == ".gz": # Decompress reference if necessary
new_ref_path = out_dir / ref_path.stem
Expand Down
18 changes: 18 additions & 0 deletions tests/test_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -584,6 +584,24 @@ def test_invert_paired():
shutil.rmtree(out_dir, ignore_errors=True)


def test_invert_paired_2():
stats = lib.clean_paired_fastqs(
fastqs=[
(
data_dir / "sars-cov-2_100_1.fastq.gz",
data_dir / "sars-cov-2_100_2.fastq.gz",
),
],
aligner=lib.ALIGNER.bowtie2,
index=data_dir / "sars-cov-2/sars-cov-2",
out_dir=out_dir,
invert=True,
force=True,
)
assert stats[0]["reads_out"] == 92
shutil.rmtree(out_dir, ignore_errors=True)


def test_minimap2_reordering_linux():
stats = lib.clean_paired_fastqs(
fastqs=[
Expand Down

0 comments on commit 247ae42

Please sign in to comment.