-
Notifications
You must be signed in to change notification settings - Fork 18
/
README.txt
182 lines (140 loc) · 8.31 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
Steps to start:
(1) Clone the repo
- git clone https://github.com/lmart999/FASTCLIP
(2) Obtain and put a repeat masker file in ~/docs/
- This can be obtained as explained here: http://fantom.gsc.riken.jp/zenbu/wiki/index.php/Uploading_UCSC_repetitive_elements_track
- This is not included in the repo because the file is large.
- The code will, by default, look for a file with name: ~/docs/repeat_masker.bed
- Simply re-name your file to repeat_masker.bed (or modify the code to target your file name).
(3) Obtain and put the hg19 bowtie2 index in ~/docs/hg19/*
- Again, this is not included in the repo because the files are large.
- Files can be obtained here: wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip
- Format will be hg19.1.bt2, hg19.3.bt2, etc.
(4) Create a /rawdata directory within the parent (~/rawdata/)
- Move paired raw iCLP .fastq files to /rawdata
- Un-zip and use name convention: <name>_R1.fastq, <name>_R2.fastq.
(5) Create result file directory (~/results/<name>/).
- Output files for <name>_R1/2.fastq will be sent to ~/results/<name>/
---
Dependencies:
(1) Python 2.7 (for CLIPper algorithm)
- https://www.python.org/download/releases/2.7/
(2) iPython
- http://ipython.org/install.html
(3) iPython notebook (for using the various notebooks provided)
- http://ipython.org/notebook
(4) Matplotlib (plotting)
- http://matplotlib.org/
(5) Pandas (data)
- http://pandas.pydata.org/
(6) Bowtie and bedTools
- http://bowtie-bio.sourceforge.net/index.shtml
- http://bedtools.readthedocs.org/en/latest/
(7) CLIPPER
- https://github.com/YeoLab/clipper/wiki/CLIPper-Home
---
Usage:
(1) Run the main pipeline:
$ ipython fastclip.py.py <name>
(2) The plots will, by defualt, be automatically generated.
(3) The full fastclip pipeline is also provided as a notebook for manual analsis: Fastclip.ipynb.
(4) Several additional notebooks are included see *ipynb.
---
Pipeline steps explained:
(1) Get unzipped reads from the /rawdata directory.
- Format is <NAME>_R< 1 or 2>.fastq.
(2) Trim adapter from the 3' end from the reads.
- RT primer is cleaved, leaving adapters.
- Remove adapter region from the 3' end of the read.
- The adapter is an input parameter.
- Default is to remove sequnces less than N=33 reads.
- Q33 specifies the qualtiy score encoding format.
(3) Quality filter.
(4) Remove duplicates.
- This step takes advantage of the fact that 5' end of each read has a random barcode.
- Each initial starting molecule that was RT'd will have a unique barcode.
- Therefore, PCR duplicates are removed by collapsing molecules with identical 5' barcode sequences.
(5) After duplicate removal, remove the 5' barcode sequence.
(6) We then map to a repeat index.
- We use k=1, meaning bt2 will search for 1 distinct, valid alignments for each read.
- This step allows us to both remove reads that are normally blacklisted and also map reads to the repeat index.
- *** The repeat index is derived from < ... >. ***
(7) After mapping, we isolate the 5' position (RT) stop for both positive and negative strand reads.
- This represents the cross-link site in the initial expariment.
(8) For each strand, we merge RT stops between replicates.
- This means that at RT stop position must be conserved between replicates.
- If conserved, we count the total number of instances of the RT position for both replicates.
- If the total counts exceed a specified threshold, then we record these RT stops.
- Finally, we re-generate a "read" around the RT stop using the passed parameter "expand,"
- A "read" around the RT stop is required for downstream processing.
(9) Reads that do not map to the repeat index are then mapped to hg19.
- The mapped reads are processed by samtools, repeat masker, and blacklist filter.
- As with the repeat index, we then merge RT stops.
(10) Expanded reads from RT stop merging are passed to CLIPper, a peak calling algorithm.
- CLIPper returns a bed-like file format with window coordinates, reads counted per window, etc.
- We use these windows to extract "low FDR" reads from the total set of reads passed to CLIPper.
- The windows provide gene names, which we parse and use to annotate the processed reads.
- We then make bedGraph and BigWig files from this complete pool of "low FDR" reads, allowing easy visualization.
(11) Partition "low FDR" reads by gene type.
- We partition the gene names recoved from CLIPper using ENSEMBL annotation of gene name by RNA type.
- Once this is done, we also split the "low FDR" reads recovered from CLIPper by type using the gene name.
- Protein coding and lincRNA genes can be embedded snoRNAs or miRNAs that make this more challenging.
- In turn, we re-generate the initial RT stop and intersect this with two different filters.
- One filter is a snoRNA mask and the other filter is a miR mask.
- *** Both are derived from < ... >. ***
- These masks allow us to remove all "protein coding" RT stops that fall within annotated sno/mi-RNA regions.
(12) Quantification of reads per gene.
- For each gene type, we quantify the number of reads per gene.
- For all but snoRNAs, this is computed using the bed files obtained above.
- For snoRNAs, we intersect the initial pool of "low FDR" reads with custom annotation file.
- Collectivly, this gives us reads per gene for each gene type.
- All are based upon ENSEMBL annotation except for the snoRNAs.
(13) Summary of RT stop intensity around CLIPper cluster centers.
- We generate a bed file of cluster center positions using the CLIPper cluster file output.
- We use a custom perl script that generates a heatmap of RT stop intensty per cluster.
- This allows us to later visualize the distribution of RT stops per cluster.
(14) Parition protein coding reads by UTR.
- We intersect sno/mi-RNA filtered reads with ENSEMBL-derived UTR coordinates.
- We perform this such that each read assignment is mutually exclusive.
- This only isolates reads that fall within each UTR type.
- Similarly, we use a custom perl script to generate a matrix of read intensity per gene.
- This provides a complete binding profile per gene.
(15) Partition reads by ncRNA binding region.
- For non-coding RNAs, we simply annotate reads with the start and stop position for each ncRNA.
- This allows us to determine the position of each RT stop with respect to the full length of the gene.
(16) Partition repeat-mapped RT stops by region.
- The repeat RNA mapped RT stops are paritioned using the repeat custom index annotation.
- As with the ncRNAs, this is later used for visualization.
(16) Figure 1 visualizes the some of the relevant summary data.
- It includes a read count summary per pipeline step. The source data is: PlotData_ReadsPerPipeFile
- It includes a pie chart of UTR binding.
- This uses reads obtrained from intersection with ENSEMBL-derived UTR coordinates.
- The source data is: PlotData_ReadsPerGene_*UTR or CDS
- It also includes a bar graph of gene count per RNA type.
- The source data is: PlotData_ReadAndGeneCountsPerGenetype
(17) Figure 2 provides a richer summary of the UTR data.
- The upper panel is an aggregate trace of binding derived from a custom perl script.
- The lower panels provide a heatmap of binding intensity for gene exclusivly bound in each UTR or CDS.
- This allows us to isolate genes with exclusive UTR,CDS,or intronic binding.
- The source data is: PlotData_ExclusiveBound_*
(18) Figure 3 and 4 provides coverage histograms of binding across each repeat RNA and rRNA, respectivly.
- Source data: PlotData_RepeatRNAHist_*
(19) Figure 5 provides a summary of snoRNA binding data.
- The pie chart provides a summary of reads per snoRNA type.
- These are complimented by histograms of RT stop position within the snoRNA gene body.
(20) Figure 6 provides histograms of RT stop position within gene body for all remaining ncRNA types.
---
Debugging:
(1) Mapping
- Ensure that bowtie2 is in the $PATH and executable.
(2) CLIPPER
- Uses Python 2.7.
(3) Any scrip in /bin
- The provided BedGraphToBigWig is built for Linux OS.
- This, and related scripts, may be downloaded for other platforms:
http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/
(4) If there is a problem parsing clusters from CLIPper, consider the version used and update the CLIPPERoutNameDelim='_' parameter.
- The older version of CLIPper results <name>_<clusterNum>_<readPerCluster>
- The newer version has name.val__<clusterNum>_<readPerCluster>
- Therefore, for newer version set CLIPPERoutNameDelim='.'