-
Notifications
You must be signed in to change notification settings - Fork 0
/
Command_line_code.Rmd
907 lines (686 loc) · 47.7 KB
/
Command_line_code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
---
title: "Command Line Code"
author: "Emily Van Buren"
date: "`r Sys.Date()`"
output:
html_document: default
word_document: default
pdf_document: default
---
# Fastp:
FastP: Data Pre-processing
Each sample in the study is first put through a pre-processing step where adapters are trimmed and quality filtering is performed. This program used is called [FastP](https://github.com/OpenGene/fastp). This program was developed as an all-in-one pre-processing program for FastQ files with multi-threading support for high performance. The code in file fastp.sh is as follows:
The version used was 0.21.0.
```{linux, eval=FALSE}
#!/bin/bash
#SERVER=coraimmunity
PATH=$PATH:/opt/storage/opt_programs/fastp
for FILE in *_1.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _1.fq.gz $FILE)
echo $SAMP
nohup /opt/storage/opt_programs/fastp -i ${SAMP}_1.fq.gz -I ${SAMP}_2.fq.gz -o fastp/${SAMP}_fp_1.fq.gz -O fastp/${SAMP}_fp_2.fq.gz
done
```
# De novo Transcritpome
De novo transcriptomes were built using 2 different servers, HPC server at UTA and Frontera super computer from TACC. Transcriptome assembly was primarily run on Frontera and the rest of the steps were run on the HPC at UTA. Steps will denote which system was used.
## Step 1: De novo Transcriptome Assembly
### M.cavernosa
Four WP samples (2 controls and 2 diseased) and four SCTLD samples (2 controls and 2 diseased) are randomly selected and their FastP adapter-trimmed and quality-filtered reads are used to generate a de novo reference transcriptome using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). This job was run on the Frontera supercomputer at the Texas Advanced Computing Center (TACC).
Samples used in assembly:
UVI_083 (WP control)
UVI_087 (WP control)
M_c3 (SCTLD control)
M_c7 (SCTLD control)
UVI_082 (WP disease)
UVI_084 (WP disease)
M_d5 (SCTLD disease)
M_d1 (SCTLD disease)
The code in mcav_denovo_trinity.sh is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J mcav_denovo_trinity # job name
#SBATCH -o mcav_denovo_trinity.o%j # output and error file name (%j expands to jobID)
#SBATCH -e mcav_denovo_trinity.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 32 # total number of mpi tasks requested
#SBATCH -p nvdimm # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH -A DEB22003 # charge job to the specified project
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
module load tacc-singularity
PATH=$PATH:/scratch1/07127/jatbee32/WP_experiment/mcav/trinity
singularity exec -e /scratch1/07127/jatbee32/trinityrnaseq.v2.15.1.simg Trinity --normalize_reads --seqType fq --max_memory 200G --CPU 20 --SS_lib_type FR --left UVI_083_fp_1.fq.gz,UVI_087_fp_1.fq.gz,UVI_082_fp_1.fq.gz,UVI_084_fp_1.fq.gz,M_d5_1.fq.gz,M_d1_1.fq.gz,M_c3_1.fq.gz,M_c7_1.fq.gz --right UVI_083_fp_2.fq.gz,UVI_087_fp_2.fq.gz,UVI_082_fp_2.fq.gz,UVI_084_fp_2.fq.gz,M_d5_2.fq.gz,M_d1_2.fq.gz,M_c3_2.fq.gz,M_c7_2.fq.gz
```
### C.natans
Four WP samples (2 controls and 2 diseased) and 4 SCTLD samples (2 controls and 2 diseased) are randomly selected and their FastP adapter-trimmed and quality-filtered reads are used to generate a de novo reference transcriptome using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). This job was run on the Frontera supercomputer at the Texas Advanced Computing Center (TACC).
Samples used in assembly:
UVI_071 (WP control)
UVI_076 (WP disease)
UVI_080 (WPdisease - expose?)
C_c4 (SCTLD control)
C_c6 (SCLTD control)
C_d3 (SCLTD disease)
C_d8 (SCTLD disease)
The code in cnat_denovo_trinity.sh is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J cnat_denovo_trinity # job name
#SBATCH -o cnat_denovo_trinity.o%j # output and error file name (%j expands to jobID)
#SBATCH -e cnat_denovo_trinity.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 32 # total number of mpi tasks requested
#SBATCH -p nvdimm # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH -A DEB22003 # charge job to the specified project
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
module load tacc-singularity
PATH=$PATH:/scratch1/07127/jatbee32/WP_experiment/cnat/trinity
singularity exec -e /scratch1/07127/jatbee32/trinityrnaseq.v2.15.1.simg Trinity --normalize_reads --seqType fq --CPU 20 --max_memory 200G --SS_lib_type FR --left UVI_71_fp_1.fq.gz,UVI_76_fp_1.fq.gz,UVI_80_fp_1.fq.gz,C_c4_1.fq.gz,C_c6_1.fq.gz,C_d3_1.fq.gz,C_d8_1.fq.gz --right UVI_71_fp_2.fq.gz,UVI_76_fp_2.fq.gz,UVI_80_fp_2.fq.gz,C_c4_2.fq.gz,C_c6_2.fq.gz,C_d3_2.fq.gz,C_d8_2.fq.gz
```
### O.favleolata
Four samples (2 controls and 2 diseased) are randomly selected and their FastP adapter-trimmed and quality-filtered reads are used to generate a de novo reference transcriptome using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). This job was run on the Frontera supercomputer at the Texas Advanced Computing Center (TACC).
Samples used in assembly:
UVI_099 (control)
UVI_103 (control)
UVI_106 (disease)
UVI_108 (disease)
The code in ofav_denovo_trinity.sh is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J ofav_denovo_trinity # job name
#SBATCH -o ofav_denovo_trinity.o%j # output and error file name (%j expands to jobID)
#SBATCH -e ofav_denovo_trinity.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 32 # total number of mpi tasks requested
#SBATCH -p nvdimm # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH -A DEB22003 # charge job to the specified project
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
module load tacc-singularity
PATH=$PATH:/scratch1/07127/jatbee32/WP_experiment/ofav/trinity
singularity exec -e /scratch1/07127/jatbee32/trinityrnaseq.v2.15.1.simg Trinity --normalize_reads --seqType fq --CPU 20 --max_memory 200G --SS_lib_type FR --left UVI_099_fp_1.fq.gz,UVI_103_fp_1.fq.gz,UVI_106_fp_1.fq.gz,UVI_108_fp_1.fq.gz --right UVI_099_fp_2.fq.gz,UVI_103_fp_2.fq.gz,UVI_106_fp_2.fq.gz,UVI_108_fp_2.fq.gz
```
### S.siderastrea
Six samples (3 controls and 3 diseased) are randomly selected and their FastP adapter-trimmed and quality-filtered reads are used to generate a de novo reference transcriptome using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). This job was run on the Frontera supercomputer at the Texas Advanced Computing Center (TACC).
Samples used in assembly:
UVI_131 (control)
UVI_134 (exposed)
UVI_136 (diseased)
The code in ssid_denovo_trinity.sh is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J ssid_denovo_trinity # job name
#SBATCH -o ssid_denovo_trinity.o%j # output and error file name (%j expands to jobID)
#SBATCH -e ssid_denovo_trinity.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 32 # total number of mpi tasks requested
#SBATCH -p nvdimm # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH -A DEB22003 # charge job to the specified project
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
module load tacc-singularity
PATH=$PATH:/scratch1/06825/tg861249
singularity exec -e /scratch1/06825/tg861249/trinityrnaseq.v2.15.1.simg Trinity --normalize_reads --seqType fq --grid_node_CPU 2 --max_memory 200G --SS_lib_type FR --left UVI_131_fp_1.fq.gz,UVI_134_fp_1.fq.gz,UVI_136_fp_1.fq.gz, --right UVI_131_fp_2.fq.gz,UVI_134_fp_2.fq.gz,UVI_136_fp_2.fq.gz
```
### O. annularis
All samples from both studies were selected and their FastP adapter-trimmed and quality-filtered reads were used to generate a de novo reference transcriptome using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). This job was run on the HPC at University of Texas at Arlington.
Samples used in assembly:
Past_c1 (SCTLD Control)
Past_c5 (SCTLD Control)
Past_c6 (SCTLD Control)
Past_c7 (SCTLD Control)
Past_d1 (SCTLD Disease)
Past_d2 (SCTLD Disease)
Past_d3 (SCTLD Disease)
Past_d4 (SCTLD Disease)
Past_d5 (SCTLD Disease)
Past_d6 (SCTLD Disease)
Past_d7 (SCTLD Disease)
USVI_109 (WP Control)
USVI_110 (WP Exposed)
USVI_111 (WP Control)
USVI_112 (WP Exposed)
USVI_117 (WP Control)
USVI_118 (WP Disease)
The code in past_denovo_trinity.sh is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J oann_denovo_trinity # job name
#SBATCH -o oann_denovo_trinity.o%j # output and error file name (%j expands to jobID)
#SBATCH -e oann_denovo_trinity.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 32 # total number of mpi tasks requested
#SBATCH -p long # queue (partition) -- normal, development, etc.
#SBATCH -t 8-00:00:00 # run time (dd-hh:mm:ss) - 8 days
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/trinityrnaseq.v2.15.1.simg
PATH=$PATH:
DIR=/home/ewb8793/coral_classification/oann/fastp
module load singularity
# move to bulk directory
cd /bulk/ewb8793/trinity/oann
singularity exec -e /home/ewb8793/trinityrnaseq.v2.15.1.simg Trinity --normalize_reads --seqType fq --CPU 32 --max_memory 200G --SS_lib_type FR --left $DIR/UVI_097_fp_1.fq.gz,$DIR/UVI_093_fp_1.fq.gz,$DIR/O_c2_fp_1.fq.gz,$DIR/O_c3_fp_1.fq.gz,$DIR/UVI_096_fp_1.fq.gz,$DIR/UVI_092_fp_1.fq.gz,$DIR/O_d6_fp_1.fq.gz,$DIR/O_d1_fp_1.fq.gz --right $DIR/UVI_097_fp_2.fq.gz,$DIR/UVI_093_fp_2.fq.gz,$DIR/O_c2_fp_2.fq.gz,$DIR/O_c3_fp_2.fq.gz,$DIR/UVI_096_fp_2.fq.gz,$DIR/UVI_092_fp_2.fq.gz,$DIR/O_d6_fp_2.fq.gz,$DIR/O_d1_fp_2.fq.gz
```
## Step 2: Obtain coral-only transcripts
In this job, trinty outputs will be moved into a new directory to start with the obtaining of a coral only transcriptome.
Coral-only transcripts are extracted from the de novo metatranscriptome following the protocol outlined by Dimos et al. (2022).
### Create One Master Coral Transcriptome
Gene model and de novo transcritpomes were reformatted and merged together to create one master coral transcritpome. A list of the species and their DOIs are listed below:
| Species | DOI |
| ----------- | ----------- |
| *Acropora millepora* | 10.1126/science.aba4674 |
| *Actinia equina* | 10.1016/j.margen.2020.100753 |
| *Astrangia poculata* | 10.1101/2023.09.22.558704 |
| *Dendronephthya gigantea* | 10.1093/gbe/evz043 |
| *Exaiptasia diaphana* | 10.1073/pnas.1513318112 |
| *Goniastrea aspera* | 10.1186/s13059-018-1552-8 |
| *Montipora capitata* | 10.1093/gigascience/giac098 |
| *Nematostella vectensis* | 10.1101/2020.10.30.359448 |
| *Orbicella faveolata* | 10.1016/j.cub.2016.09.039 |
| *Orbicella annularis* | 10.1038/s41467-023-38612-4 |
| *Pachyseris speciosa* | 10.1016/j.cub.2021.03.028 |
| *Plesiastrea versipora* | 10.1016/j.ympev.2019.01.012 |
| *Pocillopora acuta* | 10.1093/gbe/evad149 |
| *Porites astreoides* | 10.46471/gigabyte.65 |
| *Pseudodiploria strigosa* | 10.1038/s41467-023-38612-4 |
| *Siderastrea siderea* | 10.1111/mec.16414 |
| *Stylophora pistillata* | 10.1038/s41598-017-17484-x |
| *Xenia sp.* | 10.1038/s41586-020-2385-7 |
```{linux, eval=FALSE}
# reformat contig names
# Amil gene model
awk '/^>/{print ">amils_gene." ++i; next}{print}' amil_genemodel.fasta > Amil_genemodel.fasta
# Exaiptasia gene model
awk '/^>/{print ">aipt_gene." ++i; next}{print}' exaiptasia_genemodel.fasta > Exaiptasia_genemodel.fasta
# Past gene model
awk '/^>/{print ">past_gene." ++i; next}{print}' Pastreoides_transcripts_v1.fasta > Past_genemodel.fasta
# Pspe gene model
awk '/^>/{print ">pspe_gene." ++i; next}{print}' pspe_0.12.maker_post_002.transcripts.fasta > Pspe_genemodel.fasta
# Pocillopora acuta gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Pocillopora_acuta_HIv2.genes.cds.fna out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Pocillopora_acuta_genemodel_v1.fasta fixjunk
rm Pocillopora_acuta_HIv2.genes.cds.fna
awk '/^>/{print ">Pocacu_gene." ++i; next}{print}' Pocillopora_acuta_genemodel_v1.fasta > Pocillopora_acuta_genemodel.fasta
# Apoculata gene model
awk '/^>/{print ">Apoculata_gene." ++i; next}{print}' apoculata_longest_trans_cds_formatted.fasta > Apoculata_genemodel.fasta
# Nematostella gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/NVEC_genemodel.fna out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/NVEC_genemodel.fasta fixjunk
awk '/^>/{print ">Ncev_gene." ++i; next}{print}' NVEC_genemodel.fasta > Nvec_genemodel.fasta
rm NVEC_genemodel.fasta NVEC_genemodel.fna
# Ssid Brad transcriptome
awk '/^>/{print ">Ssid_gene." ++i; next}{print}' Ssid_revision_transcriptome.fa > Ssid_bradmodel.fasta
# Ofav gene model
awk '/^>/{print ">Ofav_gene." ++i; next}{print}' ofav_genemodel.fasta > Ofav_genemodel.fasta
# Aequina gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/equina_smart.rnam-trna.merged.ggf.curated.remredun.nucl.fa out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/equina_smart_genemodel_v1.fasta fixjunk
awk '/^>/{print ">Aequina." ++i; next}{print}' equina_smart_genemodel_v1.fasta > equina_smart_genemodel.fasta
rm equina_smart_genemodel_v1.fasta equina_smart.rnam-trna.merged.ggf.curated.remredun.nucl.fa
# Gasp gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/gasp_1.0.transcripts.fasta.gz out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/gasp_1.0.transcripts.fasta fixjunk
awk '/^>/{print ">gasp." ++i; next}{print}' gasp_1.0.transcripts.fasta > gasp_genemodel.fasta
rm gasp_1.0.transcripts.fasta.gz gasp_1.0.transcripts.fasta
# Montipora capitata gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Montipora_capitata_HIv3.genes.cds.fna.gz out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Montipora_capitata_HIv3.genes.cds.fasta fixjunk
awk '/^>/{print ">Mcap." ++i; next}{print}' Montipora_capitata_HIv3.genes.cds.fasta > Montipora_capitata_genemodel.fasta
rm Montipora_capitata_HIv3.genes.cds.fasta Montipora_capitata_HIv3.genes.cds.fna.gz
# Spis gene model
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Spis.genome.annotation.CDS.longest.fa.gz out=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/Spis.genome.annotation.CDS.longest.fasta fixjunk
awk '/^>/{print ">Spis." ++i; next}{print}' Spis.genome.annotation.CDS.longest.fasta > Spis_genemodel.fasta
rm Spis.genome.annotation.CDS.longest.fa.gz Spis.genome.annotation.CDS.longest.fasta
# Xenia gene model
awk '/^>/{print ">Xenia." ++i; next}{print}' xenia_transcripts.fasta > xenia_genemodel.fasta
rm xenia_transcripts.fasta
# D_gigantea gene model
awk '/^>/{print ">Dgigantea." ++i; next}{print}' D_gigantea_transcripts.fasta > D_gigantea_genemodel.fasta
rm D_gigantea_transcripts.fasta
#Plesiastrea_versipora
Plesiastrea_versipora_transcriptome_REL.fa
awk '/^>/{print ">Plesiastrea.versipora." ++i; next}{print}' Plesiastrea_versipora_transcriptome_REL.fa > PPlesiastrea_versipora_transcriptome.fasta
rm Plesiastrea_versipora_transcriptome_REL.fa
# Pstr Kelsey transcriptome
awk '/^>/{print ">Pstr_gene." ++i; next}{print}' Pstr_reference_transcriptome.fa > Pstr_kelseymodel.fasta
# Oann Kelsey genome_guidedtranscriptome
awk '/^>/{print ">Oann_gene." ++i; next}{print}' Oann_reference_transcriptome.fa > Oann_kelseymodel.fasta
rm Oann_reference_transcriptome.fa
```
A Master Coral database is created using [BLAST](https://www.ncbi.nlm.nih.gov/books/NBK279690/toc/?report=reader):
```{bash, eval=FALSE}
#!/bin/bash
#SBATCH -J bbsplit # job name
#SBATCH -o bbsplit.o%j # output and error file name (%j expands to jobID)
#SBATCH -e bbsplit.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 6 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
DIR=/home/ewb8793/coral_classification/transcriptomes/MasterCoral_proteomes
# concatinate the files together
cat Amil_genemodel.fasta gasp_genemodel.fasta Past_genemodel.fasta Spis_genemodel.fasta Apoculata_genemodel.fasta Pocillopora_acuta_genemodel.fasta Ssid_bradmodel.fasta D_gigantea_genemodel.fasta Montipora_capitata_genemodel.fasta PPlesiastrea_versipora_transcriptome.fasta xenia_genemodel.fasta equina_smart_genemodel.fasta Nvec_genemodel.fasta Pspe_genemodel.fasta Exaiptasia_genemodel.fasta Ofav_genemodel.fasta Pstr_kelseymodel.fasta > mastercoral_final.fasta
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
/home/ewb8793/ncbi-blast-2.14.1+/bin/makeblastdb -in /home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/mastercoral_final.fasta -parse_seqids -dbtype nucl -out MasterCoral_db
```
### Extraction of Coral Only Transcripts
The longest isoform sequence is obtained using the script in [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki). The assembly is BLASTed against the Master Coral database. Finally reads with less than 95% percent identity and shorter than 150 bp long are filtered out. Some species required a more lenient identity percentage (90%), as noted in the methods of the corresponding paper.
The script is as follows: coral_only_transcripts.sh
```{bash, eval=FALSE}
#!/bin/bash
#SBATCH -J coral_only_trancripts # job name
#SBATCH -o coral_only_trancripts.o%j # output and error file name (%j expands to jobID)
#SBATCH -e coral_only_trancripts.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 6 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 24:00:00 # run time (hh:mm:ss) - 24 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/trinityrnaseq-v2.15.1/util/misc/
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
DIR=/home/ewb8793/coral_classification/transcriptomes/cnat
#move trinity outputs to transcriptome directory
#cd /home/ewb8793
#mv trinity_out_dir* /home/ewb8793/coral_classification/transcriptomes/past
PATH=$PATH:/home/ewb8793/trinityrnaseq-v2.15.1/util/misc/
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
PATH=$PATH:/home/ewb8793/cdbfasta-master
PATH=$PATH:/home/ewb8793/anaconda3/opt/transdecoder
DIR=/home/ewb8793/coral_classification/transcriptomes/cnat
# obtain longest isoforms
/home/ewb8793/trinityrnaseq-v2.15.1/util/misc/get_longest_isoform_seq_per_trinity_gene.pl trinity_out_dir.Trinity.fasta > cnat_longest_isoform.fasta
grep ">" cnat_longest_isoform.fasta | wc -l
# blast longest isoforms against MasterCoral Database
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastn -query cnat_longest_isoform.fasta -db /home/ewb8793/coral_classification/transcriptomes/MasterCoral_genemodels/MasterCoral_db -outfmt "6 qseqid evalue pident length" -max_target_seqs 1 -out cnat_coral_only_sequences.txt
grep ">" cnat_coral_only_sequences.txt | wc -l
# filter reads with less than 95% identity & shorter than 150 bp
awk '{if ($3 > 95) print $1,$2,$4 }' cnat_coral_only_sequences.txt > contigs_percent_95.txt
grep ">" contigs_percent_95.txt | wc -l
awk '{if ($3 > 150) print $1}' contigs_percent_95.txt > contigs_percent_95_bp_150.txt
grep ">" contigs_percent_95_bp_150.txt | wc -l
# filter reads with less than 90% identity & shorter than 150 bp
awk '{if ($3 > 90) print $1,$2,$4 }' cnat_coral_only_sequences.txt > contigs_percent_90.txt
grep ">" contigs_percent_90.txt | wc -l
awk '{if ($3 > 150) print $1}' contigs_percent_90.txt > contigs_percent_90_bp_150.txt
grep ">" contigs_percent_90_bp_150.txt | wc -l
```
## Step 3 & 4: Extract coral-only transcripts from assembly & Make an alignable transcriptome
### Example Script
An index of the metatranscriptome assembly is created with [cdbfasta](https://github.com/gpertea/cdbfasta). Coral-only contigs (contigs_percent_95_bp_150.txt) are matched to the metatranscriptome index. Extract the longest open reading frame from each contig and then generate its predicted peptide sequence using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki). Similar sequences are removed with [cd-hit](https://sites.google.com/view/cd-hit)
extract_and_align.sh
```{bash, eval=FALSE}
#!/bin/bash
#SBATCH -J extract_and_align # job name
#SBATCH -o extract_and_align.o%j # output and error file name (%j expands to jobID)
#SBATCH -e extract_and_align.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 6 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 24:00:00 # run time (hh:mm:ss) - 24 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
# estbalish directory
PATH=$PATH:/home/ewb8793/trinityrnaseq-v2.15.1/util/misc/
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
PATH=$PATH:/home/ewb8793/cdbfasta-master
PATH=$PATH:/home/ewb8793/anaconda3/opt/transdecoder
DIR=/home/ewb8793/coral_classification/transcriptomes/cnat
# index metatranscriptome assembly with cdbfasta
/home/ewb8793/cdbfasta-master/cdbfasta trinity_out_dir.Trinity.fasta
# match to coral-only contigs to metatranscriptome index
cat contigs_percent_95_bp_150.txt | /home/ewb8793/cdbfasta-master/cdbyank trinity_out_dir.Trinity.fasta.cidx > cnat_coral_only_transcriptome_95.fasta
cat contigs_percent_90_bp_150.txt | /home/ewb8793/cdbfasta-master/cdbyank trinity_out_dir.Trinity.fasta.cidx > cnat_coral_only_transcriptome_90.fasta
# Extract the longest open reading frame from each contig and then generate its predicted peptide sequence using TransDecoder
/home/ewb8793/miniconda3/bin/TransDecoder.LongOrfs -t cnat_coral_only_transcriptome_95.fasta
/home/ewb8793/miniconda3/bin/TransDecoder.LongOrfs -t cnat_coral_only_transcriptome_90.fasta
/home/ewb8793/miniconda3/bin/TransDecoder.Predict -t cnat_coral_only_transcriptome_95.fasta
/home/ewb8793/miniconda3/bin/TransDecoder.Predict -t cnat_coral_only_transcriptome_90.fasta
# Rename the resulting .pep file to end in .fa
cp cnat_coral_only_transcriptome_95.fasta.transdecoder.pep cnat_coral_only_transcriptome_95.fasta.transdecoder.fa
cp cnat_coral_only_transcriptome_90.fasta.transdecoder.pep cnat_coral_only_transcriptome_90.fasta.transdecoder.fa
# Similar sequences are removed with cd-hit
/home/ewb8793/miniconda3/bin/cd-hit -i cnat_coral_only_transcriptome_95.fasta.transdecoder.fa -o cnat_reference_proteome_95.fa
/home/ewb8793/miniconda3/bin/cd-hit -i cnat_coral_only_transcriptome_90.fasta.transdecoder.fa -o cnat_reference_proteome_90.fa
grep ">" cnat_reference_proteome_95.fa | wc -l
grep ">" cnat_reference_proteome_90.fa | wc -l
# make align-able transcriptome
grep ">" cnat_reference_proteome_95.fa > cnat_proteome_names_95.txt
sed 's/.p/\t/' cnat_proteome_names_95.txt > proteome_names_format_95.txt
awk '{print $1}' proteome_names_format_95.txt > contigs_to_extract_95.txt
sed 's/^.//g' contigs_to_extract_95.txt > contigs_list_95.txt
cat contigs_list_95.txt | /home/ewb8793/cdbfasta-master/cdbyank trinity_out_dir.Trinity.fasta.cidx > final_cnat_reference_transcriptome_95.fa
grep ">" final_cnat_reference_transcriptome_95.fa | wc -l
grep ">" cnat_reference_proteome_90.fa > cnat_proteome_names_90.txt
sed 's/.p/\t/' cnat_proteome_names_90.txt > proteome_names_format_90.txt
awk '{print $1}' proteome_names_format_90.txt > contigs_to_extract_90.txt
sed 's/^.//g' contigs_to_extract_90.txt > contigs_list_90.txt
cat contigs_list_90.txt | /home/ewb8793/cdbfasta-master/cdbyank trinity_out_dir.Trinity.fasta.cidx > final_cnat_reference_transcriptome_90.fa
grep ">" final_cnat_reference_transcriptome_90.fa | wc -l
```
## Step 5, 6, and7: BUSCO, Annotate and Assembly Stats
Assess the quality of the new transcriptome with BUSCO. Annotate with master coral database. And assess the assembly stats with a BBMap tool.
### Create Uniprot Database
Download the latest [UniProt](https://www.uniprot.org/help/downloads) release (Reviewed).
```{bash, eval=FALSE}
# Downloaded on June 28th 2023
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
```
Create the BLAST database out of this Uniprot download:
```{bash, eval=FALSE}
DIR=/home/ewb8793/coral_classification/transcriptomes/uniprotdb
/home/ewb8793/ncbi-blast-2.14.1+/bin/makeblastdb -in uniprot_sprot.fasta -parse_seqids -dbtype prot -out uniprot_db
```
### Example Script
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J annotate_stats # job name
#SBATCH -o annotate_stats.o%j # output and error file name (%j expands to jobID)
#SBATCH -e annotate_stats.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 26 # total number of mpi tasks requested
#SBATCH -p normal # queue (partition) -- normal, development, etc.
#SBATCH -t 3-00:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
module load singularity
# BUSCO Score
singularity exec docker://ezlabgva/busco:v5.5.0_cv1 busco -i final_mcav_reference_transcriptome_90.fa -o final_mcav90_reference_transcriptome_busco_results -l metazoa_odb10 -m tran -f -c 10
# Annotate Transcriptome
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query final_ppor_reference_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_final_ppor_reference_transcriptome.txt
# Assembly Stats
#/home/ewb8793/anaconda3/bin/java -cp /home/ewb8793/bbmap/current/ jgi.AssemblyStats2 in=final_cnat_reference_transcriptome_95.fa > Assembly_stats_95.txt
#/home/ewb8793/anaconda3/bin/java -cp /home/ewb8793/bbmap/current/ jgi.AssemblyStats2 in=final_cnat_reference_transcriptome_90.fa > Assembly_stats_90.txt
```
### Annotations for Previously Published Transcriptomes
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J annotate_stats # job name
#SBATCH -o annotate_stats.o%j # output and error file name (%j expands to jobID)
#SBATCH -e annotate_stats.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 26 # total number of mpi tasks requested
#SBATCH -p normal # queue (partition) -- normal, development, etc.
#SBATCH -t 3-00:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/ncbi-blast-2.14.1+/bin/
# Past Putnam
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../Past_genome_putnam.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_past_gene_model.txt
# Pstr
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../Pstr_reference_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_pstr_ref_transcriptome.txt
# A
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../SymbA_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_symbA_ref_transcriptome.txt
# B
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../SymbB_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_symbB_ref_transcriptome.txt
# C
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../SymbC_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_symbC_ref_transcriptome.txt
# D
/home/ewb8793/ncbi-blast-2.14.1+/bin/blastx -query ../SymbD_transcriptome.fa -db /home/ewb8793/coral_classification/transcriptomes/uniprotdb/uniprot_db -outfmt "6 sseqid qseqid evalue" -max_target_seqs 1 -num_threads 15 -out annotated_symbD_ref_transcriptome.txt
```
# BBSplit: Sort coral and Symbiodiniaceae reads
Our eukaryotic reads contain sequences that originate from both the coral host species as well as their intracellular Symbiodiniaceae. There are four predominant Symbiodiniaceae genera that form symbioses with the corals in our study, so we will use [BBMap](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/) to map reads to the coral host transcriptome, as well as Symbiodinium, Breviolum, Cladocopium, and Durusdinium transcriptomes, prior to read quantification. BBMap is splice-aware global aligner for DNA and RNA sequencing reads, and BBsplit uses BBMap to map reads to multiple transcriptomes at once and determines which transcriptome each reads matches to best.
This will create one output for each file. The text file for ref stats can be downloaded and placed into BBSplit Statistics output file for all species. It will be used at bottom for Summary of Genus Averages.
From here, each file will need to be reformatted with BBMap program to split back into two separate fastq files for forward and reverse reads.
Usage is as follows:
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J bbsplit # job name
#SBATCH -o bbsplit.o%j # output and error file name (%j expands to jobID)
#SBATCH -e bbsplit.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 6 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/bbmap/
PATH=$PATH:/home/ewb8793/anaconda3/bin/java
DIR=/home/ewb8793/coral_classification/transcriptomes/final_transcriptomes
cd /home/ewb8793/coral_classification/past/bbsplit
# BBSplit for known transcriptomes
for FILE in *_fp_1.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _fp_1.fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ align2.BBSplitter in1=${SAMP}_fp_1.fq.gz in2=${SAMP}_fp_2.fq.gz ref=${DIR}/Past_genome_putnam.fa,${DIR}/SymbA_transcriptome.fa,${DIR}/SymbB_transcriptome.fa,${DIR}/SymbC_transcriptome.fa,${DIR}/SymbD_transcriptome.fa basename=${SAMP}_%.fq.gz refstats=${SAMP}_stats.txt outu1=${SAMP}_bboutu_1.fq.gz outu2=${SAMP}_bboutu_2.fq.gz
done
# reformat output fastq files host
for FILE in *_coral_reference_transcriptome.fq.gz; do
echo ${FILE}
SAMP=$(basename -s .fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=${SAMP}_coral_reference_transcriptome.fq out1=${SAMP}_host_1.fq out2=${SAMP}_host_2.fq
done
# reformat output fastq files SymbA
for FILE in *SymbA_transcriptome.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _SymbA_transcriptome.fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=${SAMP}_SymbA_transcriptome.fq out1=${SAMP}_A_1.fq out2=${SAMP}_A_2.fq
done
# reformat output fastq files SymbB
for FILE in *_SymbB_transcriptome.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _SymbB_transcriptome.fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=${SAMP}_SymbB_transcriptome.fq out1=${SAMP}_B_1.fq out2=${SAMP}_B_2.fq
done
# reformat output fastq files SymbC
for FILE in *_SymbC_transcriptome.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _SymbC_transcriptome.fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=${SAMP}_SymbC_transcriptome.fq out1=${SAMP}_C_1.fq out2=${SAMP}_C_2.fq
done
# reformat output fastq files SymbD
for FILE in *_SymbD_transcriptome.fq.gz; do
echo ${FILE}
SAMP=$(basename -s _SymbD_transcriptome.fq.gz $FILE)
echo $SAMP
/home/ewb8793/anaconda3/bin/java -ea -Xmx10g -cp /home/ewb8793/bbmap/current/ jgi.ReformatReads in=${SAMP}_SymbD_transcriptome.fq out1=${SAMP}_D_1.fq out2=${SAMP}_D_2.fq
done
```
# Salmon: Read Quantification
[Salmon](https://salmon.readthedocs.io/en/latest/salmon.html#) is a tool built for transcript quantification. It uses two phases; indexing and quantification, to map samples. The first step, indexing, is independent of the reads and requires a reference transcript to build an index. Code for that is as follows:
## Index Building for Host Transcriptomes
For the host indexes, we can keep kmer values at a standard as we are confidant in the transcriptomes we have just built and the quality of the transcriptome.
```{linux, eval=FALSE}
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
DIR=/home/ewb8793/coral_classification/transcriptomes/salmon_index
# Mcav index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/mcav_ref_transcriptome.fa -i Mcav_index
# Ssid index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/ssid_ref_transcriptome.fa -i Ssid_index
# Ofav index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/ofav_ref_transcriptome.fa -i Ofav_index
# Cnat index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/cnat_ref_transcriptome.fa -i Cnat_index
#Pstr
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/Pstr_reference_transcriptome.fa -i Pstr_index
#Past
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/Past_reference_transcriptome.fa -i Past_index
#Oann
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/oann_ref_transcriptome.fa -i Oann_index
```
## Index Building for Symbiont Transcriptomes
For the symbiont indexes, we can drop kmer values at a standard in order to get the best quality of reads.
```{linux, eval=FALSE}
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
DIR=/home/ewb8793/coral_classification/transcriptomes/salmon_index
# Durusdinium index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/SymbD_transcriptome.fa -i Clade_D_index -k 23
# Cladocopium index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/SymbC_transcriptome.fa -i Clade_C_index -k 23
# Breviolum index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/SymbB_transcriptome.fa -i Clade_B_index -k 23
# Symbiodinium index
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon index -t ../final_transcriptomes/SymbA_transcriptome.fa -i Clade_A_index -k 23
```
## Mapping Reads: Use Salmon for quasi-mapping results
[Salmon](https://salmon.readthedocs.io/en/latest/salmon.html#) is a tool built for transcript quantification. It uses two phases; indexing and quantification, to map samples. The second phase: quantification, using quasi-mapping program to map samples to index. Quasi-mapping assumes a generally small file and increased number of repeats in reference sequences. It also takes into account splicing because a transcriptome is assumed to be used for the index.
#### Host Salmon Loop
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J salmon_quant_host # job name
#SBATCH -o salmon_quant_host.o%j # output and error file name (%j expands to jobID)
#SBATCH -e salmon_quant_host.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 8 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
#SBATCH --array=0-6 # array needs to match the number of samples you have (so Pstr has 7 samples so 0-6 is 7 total)
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
cd /home/ewb8793/coral_classification/past/salmon/host
mkdir quants
fname="sample_names.txt" # text file with just beginning of names ie Pstr_c2 etc
while read p; do
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon quant -i /home/ewb8793/coral_classification/transcriptomes/salmon_index/Past_index -l A \
-1 ${p}_host_1.fq \
-2 ${p}_host_2.fq \
-p 8 --validateMappings -o quants/${p}_quant
done <$fname
```
#### Symbiodinium Salmon Loop
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J salmon_quant_A # job name
#SBATCH -o salmon_quant_A.o%j # output and error file name (%j expands to jobID)
#SBATCH -e salmon_quant_A.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 8 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
#SBATCH --array=0-6
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
cd /home/ewb8793/coral_classification/pstr/salmon/A
mkdir quants
fname="sample_names.txt"
while read p; do
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon quant -i /home/ewb8793/coral_classification/transcriptomes/salmon_index/Clade_A_index -l A \
-1 ${p}_A_1.fq \
-2 ${p}_A_2.fq \
-p 8 --validateMappings -o quants/${p}_quant
done <$fname
```
#### Breviolum Salmon Loop
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J salmon_quant_B # job name
#SBATCH -o salmon_quant_B.o%j # output and error file name (%j expands to jobID)
#SBATCH -e salmon_quant_B.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 8 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
#SBATCH --array=0-6
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
cd /home/ewb8793/coral_classification/pstr/salmon/B
mkdir quants
fname="sample_names.txt"
while read p; do
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon quant -i /home/ewb8793/coral_classification/transcriptomes/salmon_index/Clade_B_index -l A \
-1 ${p}_B_1.fq \
-2 ${p}_B_2.fq \
-p 8 --validateMappings -o quants/${p}_quant
done <$fname
```
#### Cladocopium Salmon Loop
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J salmon_quant_C # job name
#SBATCH -o salmon_quant_C.o%j # output and error file name (%j expands to jobID)
#SBATCH -e salmon_quant_C.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 8 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
#SBATCH --array=0-6
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
cd /home/ewb8793/coral_classification/pstr/salmon/C
mkdir quants
fname="sample_names.txt"
while read p; do
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon quant -i /home/ewb8793/coral_classification/transcriptomes/salmon_index/Clade_C_index -l A \
-1 ${p}_C_1.fq \
-2 ${p}_C_2.fq \
-p 8 --validateMappings -o quants/${p}_quant
done <$fname
```
#### Durusdinium Salmon Loop
```{linux, eval=FALSE}
#!/bin/bash
#SBATCH -J salmon_quant_D # job name
#SBATCH -o salmon_quant_D.o%j # output and error file name (%j expands to jobID)
#SBATCH -e salmon_quant_D.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 8 # total number of mpi tasks requested
#SBATCH -p short # queue (partition) -- normal, development, etc.
#SBATCH -t 48:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
#SBATCH --array=0-6
PATH=$PATH:/home/ewb8793/salmon-latest_linux_x86_64/bin
cd /home/ewb8793/coral_classification/pstr/salmon/D
mkdir quants
fname="sample_names.txt"
while read p; do
/home/ewb8793/salmon-latest_linux_x86_64/bin/salmon quant -i /home/ewb8793/coral_classification/transcriptomes/salmon_index/Clade_D_index -l A \
-1 ${p}_D_1.fq \
-2 ${p}_D_2.fq \
-p 8 --validateMappings -o quants/${p}_quant
done <$fname
```
# HMMER - Pfam Annotation of Contigs
All proteomes for the transriptomes used or generated in this study were annotated with Pfam with [HMMER](http://hmmer.org/). From the manual, "HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs)." Contig annotations were filtered for e-value < 0.00001. Annotations with Pfam domains ensure function of annotated sequences.
```{bash, eval=FALSE}
#!/bin/bash
#SBATCH -J hmmer_annotate # job name
#SBATCH -o hmmer_annotate.o%j # output and error file name (%j expands to jobID)
#SBATCH -e hmmer_annotate.e%j # name of stderr error file.
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 15 # total number of mpi tasks requested
#SBATCH -p normal # queue (partition) -- normal, development, etc.
#SBATCH -t 4-00:00:00 # run time (hh:mm:ss) - 48 hours
#SBATCH [email protected]
#SBATCH --mail-type=begin # email me when the job starts
#SBATCH --mail-type=end # email me when the job finishes
PATH=$PATH:/home/ewb8793/hmmer-3.4/src
for FILE in ../*.fa; do
echo ${FILE}
SAMP=$(basename -s .fa $FILE)
echo $SAMP
/home/ewb8793/hmmer-3.4/src/hmmsearch --tblout ${SAMP}_tblout.txt --incE .00001 --cpu 10 /home/ewb8793/hmmer-3.4/Pfam-A.hmm.gz ../${SAMP}.fa
/home/ewb8793/hmmer-3.4/src/hmmsearch --domtblout ${SAMP}_domtblout.txt --incE .00001 --cpu 10 /home/ewb8793/hmmer-3.4/Pfam-A.hmm.gz ../${SAMP}.fa
/home/ewb8793/hmmer-3.4/src/hmmsearch --pfamtblout ${SAMP}_pfamtblout.txt --incE .00001 --cpu 10 /home/ewb8793/hmmer-3.4/Pfam-A.hmm.gz ../${SAMP}.fa
done
```