- Motif scanning: yamscan
- Deduplicate overlapping motif hits: yamdedup
- Higher-order sequence shuffling: yamshuf
- Miscellaneous utility scripts: Extra scripts
Requires a C compiler (tested with gcc/clang), GNU Make, and Zlib.
git clone https://github.com/bjmt/yam-toolkit # Or download a recent release
cd yam-toolkit
make
This will create the final binaries in bin/
within the project folder.
I occasionally find myself needing to scan motifs against the Arabidopsis genome from the command line. Usually having the MEME suite installed locally, I resort to using fimo. Inevitably the wait time exceeds my limited patience. Eventually I decided to perform my own re-write of fimo, only this time I would only include features I need in addition to making sure it could run fast enough to satisfy me. This effort led to creating a faster replacement, yamscan, and then later a smattering of additional related programs/utilities as my needs demanded.
I would also like to mention that the fast performance of these programs was made possible by making use of several functions from Dr. Heng Li's amazing klib C library.
A regular DNA/RNA scanner with a focus on simplicity and speed.
yamscan v1.7 Copyright (C) 2023 Benjamin Jean-Marie Tremblay
Usage: yamscan [options] [ -m motifs.txt | -1 CONSENSUS ] -s sequences.fa
-m <str> Filename of text file containing motifs. Acceptable formats: MEME,
JASPAR, HOMER, HOCOMOCO (PCM). Must be 1-50 bases wide.
-1 <str> Instead of -m, scan a single consensus sequence. Ambiguity letters
are allowed. Must be 1-50 bases wide. The -b, -t, -0, -p, and -n
flags are unused.
-s <str> Filename of fast(a|q)-formatted file containing DNA/RNA sequences
to scan. Can be gzipped. Use '-' for stdin. Omitting -s will cause
yamscan to print the parsed motifs instead of scanning.
Alternatively, solely providing -s and not -m/-1 will cause
yamscan to return sequence stats. Non-standard characters (i.e.
other than ACGTU) will be read but are treated as gaps during
scanning.
-x <str> Filename of a BED-formatted file containing ranges within
sequences which scanning will be restricted to. Must have at least
three tab-separated columns. If a fourth column is present it will
be used as the range name. If a sixth strand column is present
scanning will be restricted to the indicated strand. Note that -f
is disabled when -x is used. It is recommended the BED be sorted
for speed. Overlapping ranges are allowed, but be warned that they
will be individually scanned thus potentially introducing
duplicate hits. The file can be gzipped.
-o <str> Filename to output results. By default output goes to stdout.
-b <dbl, Comma-separated background probabilities for A,C,G,T|U. By default
dbl, the background probability values from the motif file (MEME only)
dbl, are used, or a uniform background is assumed. Used in PWM
dbl> generation.
-f Only scan the forward strand.
-t <dbl> Threshold P-value. Default: 0.0001.
-0 Instead of using a threshold, simply report all hits with a score
of zero or greater. Useful for manual filtering.
-p <int> Pseudocount for PWM generation. Default: 1. Must be a positive
integer.
-n <int> Number of motif sites used in PPM->PCM conversion. Default: 1000.
-M Mask lower case letters and do not scan.
-d Deduplicate motif/sequence names. Default: abort. Duplicates will
have the motif/sequence numbers appended. Incompatible with -x.
-r Don't trim motif (HOCOMOCO/JASPAR only, HOMER/MEME must already be
one word) and sequence names to the first word.
-l Deactivate low memory mode. Normally only a single sequence is
stored in memory at a time. Setting this flag allows the program
to instead store the entire input into memory, which can help with
performance in cases of slow disk access or gzipped files. Note
that this flag is automatically set when reading sequences from
stdin, and when multithreading is enabled.
-j <int> Number of threads yamscan can use to scan. Default: 1. Note that
increasing this number will also increase memory usage slightly.
The number of threads is limited by the number of motifs being
scanned.
-g Print a progress bar during scanning. This turns off some of the
messages printed by -w. Note that it's only useful if there is
more than one input motif.
-v Verbose mode.
-w Very verbose mode.
-h Print this help message.
yamscan reports basic information about matches, including coordinates, scores, P-values, percent of the scores from the max, and the actual match. Additional information about the motifs and sequences is included in the header, which can be used for calculating Q-values after the fact. The coordinates are 1-based.
Example output:
##yamscan v1.5 [ -t 0.04 -m test/motif.jaspar -s test/dna.fa ]
##MotifCount=1 MotifSize=5 SeqCount=3 SeqSize=158 GC=45.57% Ns=0 MaxPossibleHits=292
##seq_name start end strand motif pvalue score score_pct match
1 30 34 + 1-motifA 0.0078125 4.874 73.4 CTCGC
1 31 35 - 1-motifA 0.0341796875 2.482 37.4 TCGCG
2 4 8 + 1-motifA 0.0166015625 3.860 58.2 GTCGA
2 16 20 - 1-motifA 0.0078125 4.874 73.4 GCGAG
2 42 46 + 1-motifA 0.01953125 3.725 56.1 GTCTA
2 43 47 - 1-motifA 0.015625 3.867 58.3 TCTAG
3 28 32 + 1-motifA 0.01953125 3.725 56.1 GTCTA
One can also use yamscan to get basic information about motifs and sequences. By only using yamscan with one of these at a time, the following is output:
$ bin/yamscan -m test/motif.jaspar
----------------------------------------
Motif: 1-motifA (N1 L1)
MaxScore=6.64 Threshold=[exceeds max]
Motif PWM:
A C G T
1: -3.74 1.66 -1.12 -1.73
2: -7.23 0.38 -2.81 1.35
3: -1.43 1.34 -2.59 -0.09
4: -3.06 -7.23 1.02 0.88
5: 1.27 -0.49 -0.36 -3.36
Score=-24.14 --> p=1
Score=-12.07 --> p=0.82
Score=0.00 --> p=0.11
Score=3.32 --> p=0.022
Score=6.64 --> p=0.00098
----------------------------------------
$ bin/yamscan -s test/dna.fa
##seq_num seq_name size gc_pct n_count
1 1 55 49.09 0
2 2 70 45.71 0
3 3 33 39.39 0
$ bin/yamscan -s test/dna.fa -x test/dna.bed
##bed_range bed_name seq_num seq_name size gc_pct n_count
1:1-35(+) A 1 1 35 51.43 0
2:11-48(-) B 2 2 38 50.00 0
This mode shows the internal PWM representation of motifs, as well P-values for the min and max possible scores (with some in-between scores). Basic information about sequences is output, including size, GC percent, and the number of non-DNA/RNA letters found. If a BED file is also provided then the sequence stats are restricted to those ranges.
Finally, scanning can be restricted to only parts of the input sequences as specified in a BED file. This will, of course, result in nearly linear speed-ups to the runtime proportional to the fraction of the input sequences being scanned. Example output:
##yamscan v1.5 [ -t 0.04 -m test/motif.jaspar -s test/dna.fa -x test/dna.bed ]
##MotifCount=1 MotifSize=5 BedCount=2 BedSize=73 SeqCount=3 SeqSize=158 GC=45.57% Ns=0
##bed_range bed_name seq_name start end strand motif pvalue score score_pct match
1:1-35(+) A 1 30 34 + 1-motifA 0.0078125 4.874 73.4 CTCGC
2:11-48(-) B 2 16 20 - 1-motifA 0.0078125 4.874 73.4 GCGAG
2:11-48(-) B 2 43 47 - 1-motifA 0.015625 3.867 58.3 TCTAG
The two programs have slightly different defaults, so right out of the box they will not give identical values for scores and P-values. However with some slight tweaking to even things out we can see the outputs are nearly identical.
fimo:
$ fimo --bfile --uniform-- --motif-pseudo 1 --no-qvalue --text --thresh 0.02 test/motif.meme test/dna.fa
Using motif +motif of width 5.
Using motif -motif of width 5.
motif_id motif_alt_id sequence_name start stop strand score p-value q-value matched_sequence
motif 1 30 34 + 4.87379 0.00781 CTCGC
motif 2 4 8 + 3.83495 0.0166 GTCGA
motif 2 16 20 - 4.87379 0.00781 CTCGC
motif 2 42 46 + 3.69903 0.0195 GTCTA
motif 2 43 47 - 3.91262 0.0137 CTAGA
motif 3 28 32 + 3.69903 0.0195 GTCTA
yamscan (also manually setting the nsites value found in the motif file):
$ bin/yamscan -b 0.25,0.25,0.25,0.25 -p 1 -n 175 -s test/dna.fa -t 0.02 -m test/motif.meme
##yamscan v1.5 [ -b 0.25,0.25,0.25,0.25 -p 1 -n 175 -s test/dna.fa -t 0.02 -m test/motif.meme ]
##MotifCount=1 MotifSize=5 SeqCount=3 SeqSize=158 GC=45.57% Ns=0 MaxPossibleHits=292
##seq_name start end strand motif pvalue score score_pct match
1 30 34 + motif 0.0078125 4.877 73.4 CTCGC
2 4 8 + motif 0.0166015625 3.843 57.8 GTCGA
2 16 20 - motif 0.0078125 4.877 73.4 GCGAG
2 42 46 + motif 0.01953125 3.704 55.7 GTCTA
2 43 47 - motif 0.013671875 3.919 59.0 TCTAG
3 28 32 + motif 0.01953125 3.704 55.7 GTCTA
(Actually, you might notice one small difference: yamscan always returns the
sequence on the forward strand for each hit, whereas fimo will return reverse
complement sequences when hits are on the reverse strand. If you prefer the
fimo behaviour you can pipe the yamscan output to scripts/flip_rc.sh
)
Using GNU Time on my MacbookPro M1 and the following equivalent commands to record time elapsed and peak memory usage. fimo is from MEME v5.4.1.
Default yamscan settings (low-mem mode active):
yamscan -v -t 0.0001 -m motifs.txt -s seqs.fa > res.txt
yamscan with multi-threading (and low-mem mode implicitly disabled):
yamscan -j4 -v -t 0.0001 -m motifs.txt -s seqs.fa > res.txt
fimo with Q-values disabled and immediate printing of results:
fimo --verbosity 1 --thresh 0.0001 --text motifs.txt seqs.fa > res.txt
homer using motifs with logodds scores matching P = 0.0001:
scanMotifGenomeWide.pl motifs.txt seqs.fa > res.txt
yamscan |
yamscan -j4 |
fimo |
homer |
|
---|---|---|---|---|
100x1Kbp (100Kbp) + 10 motifs | 0.02s, 4.30MB | 0.02s, 9.91MB | 0.23s, 3.92MB | 0.48s, 6.64MB |
100x1Kbp (100Kbp) + 100 motifs | 0.20s, 5.89MB | 0.07s, 12.31MB | 1.96s, 4.44MB | 1.43s, 6.63MB |
100x10Kbp (1Mbp) + 10 motifs | 0.10s, 4.28MB | 0.06s, 11.44MB | 2.44s, 4.20MB | 1.45s, 6.67MB |
100x10Kbp (1Mbp) + 100 motifs | 0.70s, 6.02MB | 0.23s, 15.80MB | 23.24s, 4.77MB | 10.32s, 6.66MB |
TAIR10 (120Mbp) + 10 motifs | 6.89s, 41.08MB | 2.36s, 153.10MB | 4m41.99s, 4.01MB | 1m55.18s, 34.11MB |
TAIR10 (120Mbp) + 100 motifs | 1m06.29s, 41.59MB | 19.14s, 152.10MB | (not run) | (not run) |
GRCh38 (3.2Gbp) + 10 motifs | 3m04.80s, 249.50MB | 1m02.30s, 3.09GB | (not run) | (not run) |
If you are unfortunate enough to be working with genomes sized in the billions
and find this is not fast enough, then if you are willing to lose out on the
dependency-free convenience of yamscan I recommend trying out
MOODS. This library makes use of several
filtering algorithms to significantly speed up scanning (whereas yamscan
dumbly scores every possible match for all motifs across all sequences). I have
found after some brief testing that when scanning hundreds of motifs across
sequences in the Mbp-Gbp range several-fold speed-ups can be achieved. (In my
limited testing I also noticed that for smaller scanning jobs MOODS can itself
be several times slower due to the it's large startup cost -- an additional
tradeoff to keep in mind.) Alternatively, if CPU time is meaningless to you and
you have access to a large number of cores you can surpass even these
impressive scanning times by making liberal use of yamscan's -j
flag. See
the latest MOODS
paper for
details.
Remove overlapping motif hits (or any type of sequence range).
This program is meant to work with direct yamscan output, and can work with a
piped stdin
. (However it can also work with regular BED files!) This means it
expects sequence/motif/strand combinations to be in ascending order by their
start coordinates. Separate sequence/motif/strand combinations can be
interleaved (which is the case when yamscan is run using multiple threads),
but within each unique combination the entries must be coordinate sorted.
It tries its utmost to use the least amount of memory required to faithfully
remove overlapping hits across the entire file, but for complex inputs it
may still end up using several dozen MBs. As an example, running yamdedup
with a BED file containing ~30,000,000 unique hits duplicated three times
(final range count of ~120,000,000, or <2 GB gzip-compressed) for ~500 motifs
(interleaved) across the Arabidopsis genome runs in about 48 seconds and
reaches 80 MB peak memory usage. Of course this will vary wildly depending on
how much of the input is overlapping; if all ranges in a file are overlapping
in one big chain, then yamdedup will be forced to store the entire input in
memory at a time.
yamdedup v1.0 Copyright (C) 2022 Benjamin Jean-Marie Tremblay
Usage: yamdedup [options] -i [ results.txt | ranges.bed ]
-i <str> Filename of yamscan results file or a tab-delimited BED file
with at least six columns: (1) sequence name, (2) 0-based start,
(3) end, (4) motif name, (5) motif score, and (6) the strand.
If a yamscan results file is provided the P-value is used for
deciding which overlapping hit(s) to discard, otherwise for BED
the score column is used and lower scores are discarded. The input
is assumed to be partially sorted: different sequence/motif/strand
combinations can be interleaved, but the individual combinations
themselves must be sorted by their start coordinates. The
yamscan program outputs its results this way, so no additional
sorting is needed. Can be gzipped. Use '-' for stdin.
-o <str> Filename to output results. By default output goes to stdout.
-s Ignore strand when finding overlapping ranges.
-m Ignore motif name when finding overlapping ranges.
-0 Ignore scores when removing overlapping ranges, causing yamdedup
to simply remove overlapping ranges in the order they appear.
-S Sort on range size instead of score/p-value (keeping larger ones).
-r Sort in opposite order (i.e., keep lower scores, higher p-values,
or smaller ranges).
-y Force yamdedup to treat the input as a yamscan output file.
-b Force yamdedup to treat the input as a BED file.
-v Verbose mode.
-w Very verbose mode.
-h Print this help message.
Let us consider the following basic scenario:
$ bin/yamscan -t 0.2 -m test/motif.jaspar -s test/dna.fa | head -n6
##yamscan v1.5 [ -t 0.2 -m test/motif.jaspar -s test/dna.fa ]
##MotifCount=1 MotifSize=5 SeqCount=3 SeqSize=158 GC=45.57% Ns=0 MaxPossibleHits=292
##seq_name start end strand motif pvalue score score_pct match
1 3 7 + 1-motifA 0.060546875 1.459 22.0 GCTGA
1 7 11 + 1-motifA 0.1640625 -1.165 -17.6 ACTGA
1 11 15 + 1-motifA 0.0654296875 1.236 18.6 ATCGA
In this example we can see that all three hits overlap, but the middle hit has a much worse score. yamdedup will recognize this and simply remove only that hit:
$ bin/yamscan -t 0.2 -m test/motif.jaspar -s test/dna.fa | head -n6 | bin/yamdedup -i-
##yamscan v1.5 [ -t 0.2 -m test/motif.jaspar -s test/dna.fa ]
##MotifCount=1 MotifSize=5 SeqCount=3 SeqSize=158 GC=45.57% Ns=0 MaxPossibleHits=292
##seq_name start end strand motif pvalue score score_pct match
1 3 7 + 1-motifA 0.060546875 1.459 22.0 GCTGA
1 11 15 + 1-motifA 0.0654296875 1.236 18.6 ATCGA
Again, yamdedup won't mind if the input is a properly formatted BED file:
$ bin/yamscan -t 0.2 -m test/motif.jaspar -s test/dna.fa | head -n6 | scripts/to_bed.sh | bin/yamdedup -i-
1 2 7 1-motifA 12 + 1.459 22.0 0.060546875 .
1 10 15 1-motifA 11 + 1.236 18.6 0.0654296875 .
There are a few options available for controlling the behaviour of yamdedup.
For example, the default behaviour for BED files is to prioritize keeping
higher scores, but this can be reversed using -r
:
$ bin/yamscan -t 0.2 -m test/motif.jaspar -s test/dna.fa | head -n6 | scripts/to_bed.sh | bin/yamdedup -i- -r
1 6 11 1-motifA 7 + -1.165 -17.6 0.1640625 .
Now we can see that the middle lower-scoring hit was kept instead.
Other options include ignoring the strand of motif hits (-s
) if you wish to
consider hits on opposite strands to be seen as overlapping, or even ignore the
motif names (-m
) themselves to remove any overlapping range period.
Overlapping hits can be removed in the order they appear, irrespective of their
hit score, using -0
, or even consider the size of the ranges (-S
) instead
of the scores (perhaps more useful for non-motif BED inputs). (In the future I
may consider adding additional options such as requiring specific amounts of
overlap.)
A regular DNA/RNA sequence shuffler with a focus on simplicity and speed.
yamshuf v1.3 Copyright (C) 2023 Benjamin Jean-Marie Tremblay
Usage: yamshuf [options] -i sequences.fa
-i <str> Filename of fast(a|q)-formatted file containing DNA/RNA sequences
to scan. Can be gzipped. Use '-' for stdin. Non-standard
characters (i.e. other than ACGTU) will be read but are treated as
the letter N during shuffling (exceptions: when -l is used or when
-k is set to 1). Fastq files will be output as fasta.
-k <int> Size of shuffled k-mers. Default: 3. When k = 1 a Fisher-Yates
shuffle is performed. Max k for Euler/Markov methods: 9.
-o <str> Filename to output results. By default output goes to stdout.
-s <int> Seed to initialize random number generator. Default: 4.
-m Use Markov shuffling instead of performing a random Eulerian walk.
Essentially generates random sequences with similar k-mer
compositions. Generally requires large sequences to be effective.
-l Split up the sequences linearly into k-mers and do a Fisher-Yates
shuffle instead of performing a random Eulerian walk. Very fast.
-r <int> Repeat shuffling for each sequence any number of times. The repeat
number will be appended to the sequence name. Default: 0.
-R Reset the random number generator every time a new sequence is
shuffled using the set seed instead of only setting it once.
-n Output sequence as RNA. By default the sequence is output as DNA,
even if the input is RNA. This flag only applies when k > 1 and -l
is not used, since in such cases the existing sequence letters are
simply being rearranged.
-p Activate an alternate mode which prints k-mer counts instead of
shuffling. All options excepting -i, -k and -o are ignored.
-v Verbose mode.
-w Very verbose mode.
-h Print this help message.
yamshuf uses a few different shuffling implementations. When k = 1, it performs
a Fisher-Yates shuffle of the input sequences. For higher values of k, there
are three available methods. The default method (Euler) involves constructing
a k-mer edge graph from the k-mer counts of the input sequences and finding a
random Eulerian walk through all of the available k-mers (thus ending up with
the exact same number of k-mers in the final shuffled sequences). The -m
flag triggers the use of the Markov method, where a Markov chain is constructed
from the k-mer counts in the sequences and the resulting probabilities are used
for generating new sequences of equal length (thus ending up with new sequences
with similar, but not exactly the same, k-mer counts). Finally, the -l
flag
results in the input sequences being split linearly into chunks (with length k)
which are then shuffled via the Fisher-Yates method.
Using GNU Time on my MacbookPro M1 and the following equivalent commands to record time elapsed and peak memory usage. A fasta file containing 10 sequences (each 10 Mbp long) with the alphabet ACGTN is used as input.
Default yamshuf settings (Euler shuffling, or regular Fisher-Yates for k = 1):
yamshuf -k $K -i 100Mbp.fa > shuffled.fa
yamshuf with Markov shuffling:
yamshuf -m -k $K -i 100Mbp.fa > shuffled.fa
yamshuf with linear shuffling:
yamshuf -l -k $K -i 100Mbp.fa > shuffled.fa
MEME v5.4.1 fasta-shuffle-letters
tool (using the uShuffle library):
fasta-shuffle-letters -k $K 100Mbp.fa shuffled.fa
yamshuf |
yamshuf -m |
yamshuf -l |
fasta-shuffle-letters |
|
---|---|---|---|---|
10x10Mbp, k = 1 | 0.40s, 19.46MB | (n/a) | (n/a) | 0.63s, 52.82MB |
10x10Mbp, k = 2 | 1.80s, 19.49MB | 1.66s, 19.49MB | 0.36s, 19.52MB | 2.62s, 416.46MB |
10x10Mbp, k = 3 | 1.89s, 19.42MB | 1.70s, 19.42MB | 0.32s, 19.42MB | 3.40s, 416.50MB |
10x10Mbp, k = 4 | 1.93s, 19.49MB | 1.71s, 19.42MB | 0.30s, 19.42MB | 4.36s, 416.50MB |
10x10Mbp, k = 5 | 2.01s, 19.46MB | 1.87s, 19.47MB | 0.29s, 19.42MB | 6.86s, 425.68MB |
10x10Mbp, k = 6 | 2.06s, 19.66MB | 1.77s, 19.55MB | 0.34s, 19.42MB | 13.42s, 416.98MB |
10x10Mbp, k = 7 | 2.53s, 20.35MB | 2.25s, 20.05MB | 0.30s, 19.42MB | 33.55s, 418.54MB |
10x10Mbp, k = 8 | 2.63s, 23.87MB | 2.32s, 22.54MB | 0.28s, 19.46MB | (not run) |
10x10Mbp, k = 9 | 5.51s, 41.26MB | 4.56s, 34.72MB | 0.30s, 19.42MB | (not run) |
For some reason the fasta-shuffle-letters
program is very memory hungry, much
more so than the original uShuffle program. However the standalone uShuffle
program did not include a fasta reader, meaning it could only take sequences
as command line arguments, severely limiting the maximum sequence size (and
thus I cannot benchmark it with the above sequence sizes).
yamshuf is a rather limited program in that it only recognizes a five letter alphabet (ACGTN or ACGUN; all other characters are recognized as N). This allows the program to use hard-coded constants and graph structures. This is in contrast to other tools such as uShuffle (and my own programs universalmotif and sequence-utils) which are generically coded to allow for any alphabet, but in turn (likely) allow for fewer compiler optimizations. Additionally, yamshuf constructs a k-mer edge graph for the complete set of possible k-mers for any k, even if some k-mers are absent in the input sequences. This means that for increasing values of k, the graph structure increases exponentially. As a result shuffling with high values of k is impractical. uShuffle gets around this by building a k-mer edge graph using only available k-mers, thus allowing for much higher values of k when shuffling short sequences (for longer sequences, so many k-mers will likely be present that it becomes affected by the same issue of impractically large edge graphs).
If there is not a requirement that the shuffled sequences have the exact same
k-mer counts as the input sequences, then none of these limitations apply when
using the linear method (-l
). This is because yamshuf merely moves around
chunks of the input sequences without needing to count k-mers or build any
edge graph. (In fact, the higher the value of k, the fewer chunks there are
to move around, thus increasing the speed of the shuffling.)
A few extra utilities are included in the scripts/
folder. These take the
yamscan results via stdin
and output their results to stdout
.
add_qvals.sh
: Calculate Benjamini-Hochberg adjusted P-values (or Q-values) and add them as a tenth column. (Note: Do not deduplicate/remove overlapping hits before calculating Q-values.) Currently not compatible with yamscan run using the-x
flag. (Generally I find it doesn't make much sense to calculate Q-values when scanning for motifs.) Be warned that this can be quite slow for big inputs, since it has to sort everything twice.dedup_hits.sh
: Remove lower-scoring overlapping hits (of the same motif). Currently not compatible with yamscan run using the-x
flag. This should only be used for very small inputs (eg <100,000 rows) unless you are willing to wait a while. Note: As of yamscan v1.4 this scripts has been deprecated in favour of the yamdedup program.sort_coord.sh
: Sort the results by coordinate.sort_motif.sh
: Sort the results by motif name.sort_pval.sh
: Sort the results by P-value.
Extra arguments can be provided by setting a SORT_ARGS
environmental variable
in the shell. For example:
$ SORT_ARGS="--buffer-size=1G" scripts/add_qvals.sh < results.txt
flip_rc.sh
: Reverse complement sequence matches on the reverse strand. This can be useful if you wish to see matches from the strand the motif was matched from, as the default is to always return the sequence from the forward strand.std_kmers.sh
: Filter the output ofyamshuf -p
to only include k-mers containing standard letters (ACGT or ACGU).to_bed.sh
: Convert the results to a BED6+4 format.to_gff3.sh
: Convert the results to GFF3.to_gtf.sh
: Convert the results to GTF/GFF2.
See the scripts for a description of the output formats.
In this example, the output of yamscan is first piped to flip_rc.sh
to
reverse complement the reverse strand motif matches, then to add_qvals.sh
to calculate Q-values, to dedup_hits.sh
to remove overlapping hits,
coordinate sorted with sort_coord.sh
, and finally converted to BED with
to_bed.sh
:
bin/yamscan -t 0.05 -m test/motif.homer -s test/dna.fa \
| scripts/flip_rc.sh \
| scripts/add_qvals.sh \
| scripts/dedup_hits.sh \
| scripts/sort_coord.sh \
| scripts/to_bed.sh \
> res.clean.txt
The format will be auto-detected by yamscan. A brief overview of the requirements for each format follows.
Example JASPAR motif:
>1-motifA
A [ 3 0 16 5 106 ]
C [ 139 57 111 0 31 ]
G [ 20 6 7 89 34 ]
T [ 13 112 41 81 4 ]
JASPAR motifs each have a header line starting with >
followed by the motif
name. Counts for each DNA letter at each position follow this line. Each row
starts the character for each DNA letter, and is followed by counts for each
position in the motif enclosed by square brackets. These counts must be
integers.
Example HOMER motif:
>CYCKA 1-motifA 5.30478607528482
0.015 0.798 0.113 0.074
0.000 0.325 0.033 0.641
0.094 0.630 0.040 0.236
0.028 0.000 0.510 0.463
0.607 0.177 0.192 0.024
HOMER motif header lines have (at least) three tab-separated essential elements,
in addition to needing to begin with the >
character:
-
Consensus sequence (yamscan ignores this but requires something be here)
-
One-word name
-
Minimum logodds score (yamscan ignores this but will warn if missing and running with
-v
/-w
)
This header line is immediately followed by the motif probability values (first column for A, second for C, third for G, and fourth for T/U).
Example MEME motif:
MEME version 5
ALPHABET= ACGT
strands: + -
Background letter frequencies
A 0.249493 C 0.257606 G 0.249493 T 0.243408
MOTIF 1-motifA
letter-probability matrix: alength= 4 w= 5 nsites= 175 E= 0
0.015363 0.797841 0.113101 0.073695
0.000245 0.325285 0.033308 0.641162
0.093911 0.629765 0.039881 0.236444
0.027719 0.000057 0.509626 0.462598
0.606981 0.176988 0.191848 0.024182
There are three essential elements to a MEME motif file:
-
MEME version information
-
For each motif, a line that starts with
MOTIF
and contains the one-word motif name (extra text is ignored) -
For each motif, a line that starts with
letter-probability matrix
and is immediately followed by the motif probability values (first column for A, second for C, third for G, fourth for T/U)
Other optional lines:
-
The
ALPHABET
line is checked by yamscan; if using a custom alphabet definition then it is ignored (it is up to the user to ensure the custom alphabet represents DNA/RNA) -
The
strands:
line is briefly examined, and if it does not match the yamscan settings for which strands to scan a warning will be emitted if using-v
/-w
-
Background values will be used if a line starting with
Background letter frequencies
and immediately followed by probabilities for A,C,G,T/U is found
Both full and minimal MEME motif files can be used (alongside its derivatives DREME and STREME).
Example HOCOMOCO motif:
>1-motifA
144.000000003 180.000000003 79.000000001 97.000000001
179.000000003 57.000000001 181.000000003 83.000000001
176.000000003 53.000000001 255.000000003 16.0000000004
9.0000000002 3.0000000001 16.0000000004 472.000000006
10.0000000002 5.0000000001 476.000000006 9.0000000002
473.000000006 5.0000000001 8.0000000002 14.0000000002
16.0000000004 415.000000006 27.0000000004 42.000000001
12.0000000002 8.0000000002 438.000000006 42.000000001
39.000000001 159.000000003 8.0000000002 294.000000006
144.000000003 233.000000003 72.000000001 51.000000001
268.000000006 57.000000001 93.000000001 82.000000001
HOCOMOCO motifs have a header line starting with >
followed by the motif
name. Only mononucleotide count matrices (PCM) can be used. The counts are
split into four columns (A,C,G,T/U). These counts need not be integers. The
headers cannot contain the tab character.