Update lab note for RNA-seq data processing

Bix4UMD · Oct 3, 2024 · f38a101 · f38a101
1 parent 669d918
commit f38a101
Showing 1 changed file with 87 additions and 0 deletions.
diff --git a/docs/bulkRNAseq_lab.ipynb b/docs/bulkRNAseq_lab.ipynb
@@ -771,6 +771,93 @@
     "In the command line above, you will again run `multiqc` in singularity container. This time, `-B $PWD` is used. `$PWD` is a dynamic environmental variable that stores the current working directory in which the input and output of `multiqc` will be store. "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8788ed49",
+   "metadata": {},
+   "source": [
+    "## Use RSeQC to generate QC plots"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08f641c8",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "%%bash\n",
+    "cd /scratch/zt1/project/bioi611/user/$USER\n",
+    "sbatch ../../shared/scripts/bulkRNA_SE_s6_RSeQC_genebody_cov.sub\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "91fb378d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "#!/bin/bash\n",
+      "#SBATCH --partition=standard\n",
+      "#SBATCH -t 40:00:00\n",
+      "#SBATCH --nodes=1\n",
+      "#SBATCH --ntasks=1\n",
+      "#SBATCH --cpus-per-task=1\n",
+      "#SBATCH --job-name=bulkRNA_SE_s6_RSeQC_genebody_cov.sub\n",
+      "#SBATCH --mail-type=FAIL,BEGIN,END\n",
+      "#SBATCH --error=%x-%J-%u.err\n",
+      "#SBATCH --output=%x-%J-%u.out\n",
+      "\n",
+      "module load singularity\n",
+      "\n",
+      "## Binding path and singularity image file\n",
+      "SIF_BIND=\"/scratch/zt1/project/bioi611/\"\n",
+      "SIF_TRIMGALORE=\"/scratch/zt1/project/bioi611/shared/software/rseqc_v5.0.3.sif\"\n",
+      "SIF_BEDOPS=\"/scratch/zt1/project/bioi611/shared/software/bedops_v2.4.39.sif\"\n",
+      "## Paths to working directory and input fastq files\n",
+      "WORKDIR=\"/scratch/zt1/project/bioi611/user/$USER\"\n",
+      "\n",
+      "cd $WORKDIR\n",
+      "\n",
+      "\n",
+      "mkdir -p bulk_RNAseq_SE_RSeQC/\n",
+      "singularity exec -B $SIF_BIND $SIF_TRIMGALORE geneBody_coverage.py -r /scratch/zt1/project/bioi611/shared/reference/Caenorhabditis_elegans.WBcel235.111.bed -i bulkRNA_SE_STAR_align/N2_day1_rep1.Aligned.sortedByCoord.out.bam,bulkRNA_SE_STAR_align/N2_day7_rep1.Aligned.sortedByCoord.out.bam -o bulk_RNAseq_SE_RSeQC/geneBody_cov\n",
+      "\n",
+      "\n",
+      "# Test command line which can be completed in less than 2 minutes \n",
+      "# singularity exec -B $SIF_BIND $SIF_TRIMGALORE geneBody_coverage.py -r test_1000genes.bed -i bulkRNA_SE_STAR_align/N2_day1_rep1.Aligned.sortedByCoord.out.bam -o test_genebody_cov/test\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%bash\n",
+    "cd /scratch/zt1/project/bioi611/user/$USER\n",
+    "cat ../../shared/scripts/bulkRNA_SE_s6_RSeQC_genebody_cov.sub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d003be4c",
+   "metadata": {},
+   "source": [
+    "`Caenorhabditis_elegans.WBcel235.111.bed` is used as one of the input for `geneBody_coverage.py` in RSeQC. To understand the bed file format, please refer to the link below: \n",
+    "\n",
+    "https://genome.ucsc.edu/FAQ/FAQformat.html#format1\n",
+    "\n",
+    "The bed file can be genreated using GFF3 file. GFF3 format is a similar format as GTF. To generate bed file from GFF3 file, you can use the command line below:\n",
+    "\n",
+    "```\n",
+    "wget https://ftp.ensembl.org/pub/release-111/gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.111.gff3.gz\n",
+    "export PATH=\"/scratch/zt1/project/bioi611/shared/software:$PATH\"\n",
+    "gff3ToGenePred  Caenorhabditis_elegans.WBcel235.111.gff3  Caenorhabditis_elegans.WBcel235.111.phred\n",
+    "genePredToBed  Caenorhabditis_elegans.WBcel235.111.phred Caenorhabditis_elegans.WBcel235.111.bed\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "5b4d74ef",