Skip to content

Commit

Permalink
update docs with examples
Browse files Browse the repository at this point in the history
  • Loading branch information
meren committed Aug 16, 2024
1 parent 6a6d0ce commit a6e4a0d
Showing 1 changed file with 203 additions and 0 deletions.
203 changes: 203 additions & 0 deletions anvio/docs/programs/anvi-get-sequences-for-hmm-hits.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,209 @@ anvi-get-sequences-for-hmm-hits -c %(contigs-db)s \
-o %(genes-fasta)s
{{ codestop }}

### Change the formatting of the output FASTA files

Please note that this program allows you to format the deflines of the resulting FASTA file to a great extent whenever possible. For this, it uses a set of previously-defined variables you can use to define a template of your liking. You can learn about the available variables, you can include the following flag in your command:

{{ codestart }}
anvi-get-sequences-for-hmm-hits -c %(contigs-db)s \
--list-defline-variables
{{ codestop }}

Which will give you an output similar to the one below:

```
WARNING
===============================================
Here are the variables you can use to provide a user-defined defline template:
* {gene_name}
* {gene_callers_id}
* {contig_name}
* {gene_unique_id}
* {bin_name}
* {source}
* {e_value}
* {start}
* {stop}
* {length}
Remember, by default, anvi'o will use the following template to format the
deflines of FASTA files it produces whenever possible.
{gene_name}___{gene_unique_id} bin_id:{bin_name}|source:{source}|e_value:{e_value}|contig:{contig_name}|gene_callers_id:{gene_callers_id}|start:{start}|stop:{stop}|length:{length}
```

With this default template, %(anvi-get-sequences-for-hmm-hits)s will provide a FASTA file with quite a comprehensive defline. The following examples use the data pack for the Infant Gut Dataset, after running the following command in the data directory:

{{ codestart }}
%(anvi-import-collection)s additional-files/collections/merens.txt \
-p PROFILE.db \
-c CONTIGS.db \
-C merens
{{ codestop }}

Here are a few commands that show how different deflines impact the FASTA output, starting with the default defline format:

```
anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-o OUTPUT.fa
```

```
>RsfS___Bacteria_71___a2cb9835d40f1ea052bc7aea2ef8c12924c76f4fa00e00b69f7dbecb bin_id:CONTIGS|source:Bacteria_71|e_value:2.6e-14|contig:Day17a_QCcontig1|gene_callers_id:22|start:17642|stop:17792|length:150
>SecE___Bacteria_71___fe4bf2883dfee62dcf2ed93acde8df5de5be421675d7231997dc5f91 bin_id:CONTIGS|source:Bacteria_71|e_value:3.6e-22|contig:Day17a_QCcontig1|gene_callers_id:105|start:95904|stop:96075|length:171
>Ribosomal_L1___Bacteria_71___9a3cc2232443c0c0e995fef5f76f34d52fb111476e0e2fa2dbe0f09e bin_id:CONTIGS|source:Bacteria_71|e_value:8.4e-59|contig:Day17a_QCcontig1|gene_callers_id:116|start:106378|stop:107068|length:690
>SecG___Bacteria_71___d23b8716a0b03543405ac9835ce127b981fc1c067d9b1e48edf9d861 bin_id:CONTIGS|source:Bacteria_71|e_value:1e-20|contig:Day17a_QCcontig1|gene_callers_id:205|start:200881|stop:201118|length:237
>SmpB___Bacteria_71___a95dd711a5d10cfb4bd9ca3d836aed077d35647161d4908a9c8252fe bin_id:CONTIGS|source:Bacteria_71|e_value:2.8e-66|contig:Day17a_QCcontig1|gene_callers_id:208|start:204418|stop:204883|length:465
```

---

```
anvi-get-sequences-for-hmm-hits -c CONTIGS.db \
-o OUTPUT.fa \
-p PROFILE.db \
-C merens
```

```
>RsfS___Bacteria_71___a2cb9835d40f1ea052bc7aea2ef8c12924c76f4fa00e00b69f7dbecb bin_id:E_facealis|source:Bacteria_71|e_value:2.6e-14|contig:Day17a_QCcontig1|gene_callers_id:22|start:17642|stop:17792|length:150
>SecE___Bacteria_71___fe4bf2883dfee62dcf2ed93acde8df5de5be421675d7231997dc5f91 bin_id:E_facealis|source:Bacteria_71|e_value:3.6e-22|contig:Day17a_QCcontig1|gene_callers_id:105|start:95904|stop:96075|length:171
>Ribosomal_L1___Bacteria_71___9a3cc2232443c0c0e995fef5f76f34d52fb111476e0e2fa2dbe0f09e bin_id:E_facealis|source:Bacteria_71|e_value:8.4e-59|contig:Day17a_QCcontig1|gene_callers_id:116|start:106378|stop:107068|length:690
>SecG___Bacteria_71___d23b8716a0b03543405ac9835ce127b981fc1c067d9b1e48edf9d861 bin_id:E_facealis|source:Bacteria_71|e_value:1e-20|contig:Day17a_QCcontig1|gene_callers_id:205|start:200881|stop:201118|length:237
>SmpB___Bacteria_71___a95dd711a5d10cfb4bd9ca3d836aed077d35647161d4908a9c8252fe bin_id:E_facealis|source:Bacteria_71|e_value:2.8e-66|contig:Day17a_QCcontig1|gene_callers_id:208|start:204418|stop:204883|length:465
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3
```

```
>Ribosomal_L27___Bacteria_71___477a28bece0b76ebaef1a9415cbd15b422ec581619e437a79f227f2f bin_id:E_facealis|source:Bacteria_71|e_value:4.1e-37|contig:Day17a_QCcontig2|gene_callers_id:1130|start:86504|stop:86792|length:288
>Ribosomal_L3___Bacteria_71___611453ec6ee187ebae05815acd0d20efa3be77b791a69055aa9ae776 bin_id:S_epidermidis|source:Bacteria_71|e_value:1.4e-19|contig:Day17a_QCcontig7|gene_callers_id:2339|start:19929|stop:20592|length:663
>Ribosomal_L3___Bacteria_71___3e44e2b950f135b147b3a5ca7813581228c3d91c7d2cb69c8543237d bin_id:E_facealis|source:Bacteria_71|e_value:1.5e-20|contig:Day17a_QCcontig16|gene_callers_id:3080|start:229935|stop:230565|length:630
>Ribosomal_L3___Bacteria_71___0d18b80b3c3eb77af3683e64fa4278ac08929c53c8a4593c205b8849 bin_id:P_rhinitidis|source:Bacteria_71|e_value:9.1e-17|contig:Day17a_QCcontig21|gene_callers_id:3313|start:54724|stop:55357|length:633
>Ribosomal_L28___Bacteria_71___e36828b18440c369bcf880a267cea6d55fa65cf1b0b1d66c8784901a bin_id:E_facealis|source:Bacteria_71|e_value:2.2e-24|contig:Day17a_QCcontig23|gene_callers_id:3633|start:124223|stop:124412|length:189
>Ribosomal_L27___Bacteria_71___bd93789ee7b2f24fefa4ddfc0949be06573a6f6f3cdc85a6ada54323 bin_id:S_epidermidis|source:Bacteria_71|e_value:1e-36|contig:Day17a_QCcontig29|gene_callers_id:4151|start:9097|stop:9382|length:285
>Ribosomal_L27___Bacteria_71___2b6e4f252a14a16375d107711cde0363f266f1981fa353a2421fd989 bin_id:S_aureus|source:Bacteria_71|e_value:7.2e-37|contig:Day17a_QCcontig56|gene_callers_id:6123|start:46552|stop:46837|length:285
>Ribosomal_L27___Bacteria_71___51f9c99d5530e7342657805b778105e8e75c178b1780140163e528e3 bin_id:P_rhinitidis|source:Bacteria_71|e_value:3.8e-36|contig:Day17a_QCcontig58|gene_callers_id:6253|start:92104|stop:92401|length:297
>Ribosomal_L28___Bacteria_71___4be054e14a9cb7f80dea306fbbcb391a195acda0a705853b6b010a2e bin_id:P_avidum|source:Bacteria_71|e_value:1.7e-26|contig:Day17a_QCcontig60|gene_callers_id:6345|start:76696|stop:76933|length:237
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3 \
--defline-format "{"
```

```
Init .........................................: 4451 splits in 13 bin(s)
Config Error: Your f-string syntax is not working for anvi'o :/ Perhaps you forgot to open or
close a curly bracket?
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3 \
--defline-format "{lol}"
```

```
Config Error: Some of the variables in your f-string does not occur in the source dictionary
:/ Here is the list of those that are not matching to anything: lol. In the
meantime, these are the known keys: gene_name, gene_callers_id, contig_name,
gene_unique_id, bin_name, source, e_value, start, stop, length.
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3 \
--return-best-hit \
--defline-format "{bin_name}_{gene_callers_id}"
```

```
>E_facealis_1130
>S_epidermidis_2339
>E_facealis_3080
>P_rhinitidis_3313
>E_facealis_3633
>S_epidermidis_4151
>S_aureus_6123
>P_rhinitidis_6253
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3 \
--return-best-hit \
--defline-format "{bin_name}_{source}_{gene_callers_id}"
```

```
>E_facealis_Bacteria_71_1130
>S_epidermidis_Bacteria_71_2339
>E_facealis_Bacteria_71_3080
>P_rhinitidis_Bacteria_71_3313
```

---

```
anvi-get-sequences-for-hmm-hits -p PROFILE.db \
-c CONTIGS.db \
-C merens \
-o OUTPUT.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_L27,Ribosomal_L28,Ribosomal_L3 \
--return-best-hit \
--defline-format "{gene_name}_{gene_callers_id} source:{source}|contig:{contig_name}|start:{start}|stop:{stop}"
```

```
>Ribosomal_L27_1130 source:Bacteria_71|contig:Day17a_QCcontig2|start:86504|stop:86792
>Ribosomal_L3_2339 source:Bacteria_71|contig:Day17a_QCcontig7|start:19929|stop:20592
>Ribosomal_L3_3080 source:Bacteria_71|contig:Day17a_QCcontig16|start:229935|stop:230565
>Ribosomal_L3_3313 source:Bacteria_71|contig:Day17a_QCcontig21|start:54724|stop:55357
```

---

Please note that anvi'o will not check whether your defline format will result in FASTA entries with identical deflines.


### Get HMM hits independently aligned and concatenated

The resulting file can be used for phylogenomics analyses via %(anvi-gen-phylogenomic-tree)s or through more sophisticated tools for curating alignments and computing trees.
Expand Down

0 comments on commit a6e4a0d

Please sign in to comment.