-
Notifications
You must be signed in to change notification settings - Fork 8
/
todo
119 lines (119 loc) · 6.66 KB
/
todo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
- Do the read/write_motifs() functions work with the gap slots?
- Scan for motif clusters?
- motif_peaks() is in dire need of a re-write
- Check that everything in convert_motifs() still works
- Check that all read/write_*() functions still work
- Add tests
- Visualisation of scan_sequences()/enrich_motifs() --> mention EnrichedHeatmap
- Work on Rmd output of compare_motifs()
- motif_peaks(): compare two sets of sequences (enrichment analysis)
- motif_peaks(): serious code cleanup needed
- dependencies to consider removing:
+ !! yaml (write my own parser, doesn't matter if it's slow)
- Remove some of the input parameter checking in merge_motifs()/view_motifs()/
motif_tree(), rely solely on compare_motifs()
- in read_*() files, check to input file exists
- change internal motif precision limit from 1e-3 to 1e-6 (will require changes
to how allow.nonfinite works)
- add a vignette section about pseudocounts and how the functions interact with
them
- small meme p-values solution: keep the pval/qval/eval slots as log-transformed,
un-log for print and `[` methods?
- de novo motifs:
+ first search 4/5/6/7/8-mers for enrichment differences (either
against background or statistically expected based on 1/2/3-let frequencies)
+ then for significantly enriched k-mers, start changing PWM positions to
see if that increases significance/enrichment OR start merging enriched
k-mers? (fast version: only k-mer merging?)
+ for every enriched k-mer, scan with a logodds threshold of 0 -- look
for enrichment of lower score hits?
+ finally extend motif until desired length
+ trim useless edges
- is the MEME E/p-value log10 or natural log?
- let switch_alph() be used to change to arbitrary alphabets (but keep default
behaviour consistent with previous version!)
- time for another make_DBscores() re-write I think --> use dynamic pvals?
- need a function which can allow for motifs with different background freqs
to be comparable: maybe something like PPM --> PWM (change background) PWM
--> PPM?
- create_motif(): is there a check for the "type" input?
+ why does create_motif(type="asdf") work?
- read_*(): allow to set pseudocount
- make n^k protection in motif_pvalue() optional
- make sure no errors are possible inside c++ code -- can lead to memory leaks
since c++/R may not communicate properly about what needs to be freed
- new view_logo() example idea: show how to plot A->C mutation change (need to
add arrows to fontDFroboto)
- read_matrix(): make it so that it's possible to read all other formats with
it via tweaking parsing options
- go through vignettes again
- use NULL instead of missing for optional args
- get rid of DataFrame and just go pure data.frame? (store metadata in attributes?)
- better manipulation of alphabet slot
- stop this from working? motif["name"] <- NA
- motif_pvalue() vignette bit: explain why exact calculations dont give exact
results (rounding to three digits for use as ints)
- filter_motifs() example in vignette: nsites column returning NAs?
- weird interaction:
+ mydf %>% dplyr::rename(name = altname, altname = name) %>% to_list()
+ if altname was blank, this causes "0" to be filled in the name slot
- scan_sequences() and large datasets: spends a LOT of time in pre- and
post-processing... need to optimize this. The actual scanning is quite fast
even with billions of letters.
- Make use of average_ic() to filter motifs in more function examples? Right
now it just shows up in compare_motifs().
- merge_motifs() 2nd example: also plot submotifs
- make sure window size can't be set too low (shuffle_sequences, get_bkg)
- motif_score(create_motif("ATCGTACGTG"), 0, allow.nonfinite = TRUE) --> broken?
- should I change the default comparison metric back to PCC...? I feel bad for
constantly switching but ALLR is kind of annoying to work with in some cases
- add an extranum slot
- does view_motifs(use.type="ICM") work with KL motifs? (is ylim properly set)
- pretty sure the allow.nonfinite option is borked in motif_pvalue()
- native logo plotting with trees
- motif p-values vignette section: show effects of k using plots instead of tables
- emit a warning whenever the multifreq/gap slots are lost
- motif_tree() + view_motifs():
+ ggtree::inset
+ tree + ggtree::geom_subview(mot_plot_list [+theme_inset()], ...) [see inset code]
+ ggtree::facet_plot
+ ggtree::geom_fruit --> geom_subview for circular layouts
- would it be faster to handle chars from XString directly instead of converting to
integer?
- should def rework shuffle so that it doesn't create unconnected vertices!
- dont think enrichment is taking into account respect.strand in enrich_motifs(),
only looks at RC=TRUE/FALSE
- shuffle_motifs() adds a pseudocount??
- klet --> kmer
- multifreq --> higher order ? (maybe nmer motif)
- work with methylated DNA?
- scan_sequences(motifs[1:100], athaliana, warn.NA=F,threshold=.9,
threshold.type="logodds",nthreads=6) --> crashes!!! (out of memory I think)
- enrich_motifs with size 1 sequences causes crash --> check size of seqs
- stop using multifreq slot; make motif objects represent a single order level
- comparison: add option to use sum to find best alignment, then apply mean/etc
(or p-value? see gupta et al 2007)
- create a separate repo for a universalmotif website; otherwise the vignettes
in the package will become too big/annoying
- scan_sequences() P-values: should they use the bkg of the input sequences instead
of the motif? wouldn't this calculate a better null for calc.qvals.method="fdr"?
- Really need to get straight what background probabilities to use in motif_pvalue()
when scanning; the motifs own, or those of the sequences being scanned? Also need
to make sure that feeding bkg.probs to motif_pvalue() actually uses those in the
create of the PWMs!
- Add an example in vignettes about methylated DNA, referring to
https://www.biorxiv.org/content/10.1101/043794v1
e.g. create_motif(alphabet="ACGTm", bkg = c(A=.25,C=.125,G=.25,T=.25,m=.125))
+ however this means no reverse complement scanning...
- to_df() et al: don't show warning when printing if errors are present?
- change the default pseudocount value to 0.1? (instead of 1)
--> or... change default nsites to 1000?
- fix motif_peaks()
- compare_motifs(): allow overhangs to be compared to background fequencies (see
homer)
- compare_motifs(): dynamic P-values!
- create a separate container for higher order motifs
- scan_sequences(): for large jobs, save temporary results on disk; maybe use the qs pkg?
- combine view_motifs() with EnrichedHeatmap/ComplexHeatmap?
- compare_motifs() w/ normalization: does this still work for scores w/ -ve values?
- scan_sequences(): needs a serious performance overhaul for large jobs