Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inversion detect problem #100

Open
shiyi-pan opened this issue Oct 1, 2021 · 18 comments
Open

inversion detect problem #100

shiyi-pan opened this issue Oct 1, 2021 · 18 comments

Comments

@shiyi-pan
Copy link

Hi, I want to use syri to detect SVs and I do some experiment to test the accuracy of the Software . When test INS and DEL , it works wonderful ,but when test inversion , I some problems , maybe here is a bug in it.here is what I did.
First I generated an genome include some inversions based on reference genome.
then I used minimap to map the genome include some inversions to reference genome and used syri to detect inversions. The output VCF file did have INVs.
I checkd one inversion from my fasta file and it indeed existed . Here is part of fasta files :
ref:

Chr01:17237000-17239000
TTCAAGGTCTAACAGCCACTGTTTACAACCTTCTCACCTAACCACTACCCGTGCGACCTCTACCTAAGAGCCACTCTTAGATATGAGAACCCCTCTCACTCCCTCTCAATCACACTCCCGTGTTTACAATCAAATCAAAAACACACCAGAGATTGCTCTCTGAACAAAAGAGATCAACTCTACACACTAGAGATCAACTCTACACACAAGAGATCAACTCTACACACAAGAGATCAACTCTACACACTCAGGTCCAACACTTGATGTTAGGGTACCATCAAGGTGGCTCACAAAAGACTCAAGTCCCAAAACTCACAAAATAACTCTTCAATCCCGGACTTGGTGCAGAACTCGTGCAGCCTTCATGTTTATATAGCAGTGTGCGTATCTGGGCTGCAACTTCTTGATGCTGGATGAGATCTATCATTTTCCTGAAAATCTGCACTTAAAGATCTAAAAGATACAGTTTGATCTTTTAGTTTTTATCTTTAATCTTTAATCCCTGAACGAACTATTCAAGTTTGTAATTCGAACTTTAATTATCTTTTAATTCGTTCCAAGAGATAGATCGTCTAATCTGTTGCTAACTGCACAATAATCTGTTAAAGAGATAACAGATTTATGTGTCCAGTATTTTCGGGCCGGATGTCAGGACATCGTGGATCCTGCACACTCTGTTAAAGATATAACAGATTTATGTGTCCAGTATTTTCGAGCCGGATGTCAGGACATCGTATCCGACATCGTGGATCCTGCAGCTTCAATTCTTCATTTGACATTTTATCTTGCCTTGTGCATTGTGCAGCCCGATCTGATTCCTTGACATAACGTTGGACATCATGTGCAGCAACTCCAGCTTTCCTTCATTGTCTAAGTGCTTATGTTTTAACAAAATCTTAGCCAATCTTTTAAACACTCAGTAGAGCTAAGCACTAACAAAATGTTTTATGAATGACCATCAAAGGTGACTTGGAAACACGAACTTAAAGGGAGTTTCATTGCCCAAAAACTTTTTATCCTTTCAAAAGATTAAGAGTTTTTTCTGAACTGAACTGTCTTATCCTCTCAAAAAGATTCTTTGGTCAACCACTTGCATATTCAATAAGGAATTTTGATTGACCTTCATTGTACAATCTATCTCTTTTAAGAGAGATTTCTTCTTCTCTTCTTCTTATTTCTGAAAAGGGATTAAGAGACCGTGGGTTTCTTGTTGTAGAGGATTCTTGAACACAAGGGAAGGGTTGTCCCTGTGTGGTTCAGACTTTGTTAAAAGAGTTTTACAAAGAGGATGGAAAATTTCAAGTGGGTTGCTTGAGGACTGGACGTAGGCACGGGAAGTGGCCGAACCAGTATAAATCAAGTTTGCATTCCTCTCTTCCCTTAAACTTCTTTTATTTATAGCTATTTATCTTTTGCTTTAAAGAAGTTTATTTTGAATTATCTTTTGAGTAATTCATGTTAAGGGTGCATTGTTAATCCGAAGAGAGTGAAAGTTTAATTGGGGAATAGTCTTCGTATCTTAATTCCACCCCCCCTTTTTTTTCTTAAGGTAACTTAGGCCATTTATCCAACATCCTATTCTTGATAACTCACTTCTCTCTAAAAAGACAAACTTTCCGGAATGATGAAATGAGGACACATGAACGTCTTGAAAACACAGTCAATCAAATGCTTTTTTTTTCTTTTTGAACTCTTTTTTTTTATTTTTTTATTTTGAAAATTATTTGTTTTGAACTTTACTCGTTGTTTTACGGCACCCCCACCAACGTGCAAGACGAGTAATCTCTGATTGAACAGTCTTGGAAGTCCAAACTCAGGAGCACAGGTCGCTTGAGCAAACAAACCAATGGCTTGCACCCACATCCCAGTGGAAGCAAAGATGTAATTACGAGAGGATGAGGGACAAAGATGTCAAATTTATCCATCTATTTAGCGTTGTAACTGTGGTTTACAATAATGGTATAAACTTGAAAATCCTGATGAGTCATTAGAGACATCTA

inversion:

Chr01:17237000-17239000
TTCAAGGTCTAACAGCCACTGTTTACAACCTTCTCACCTAACCACTACCCGTGCGACCTCTACCTAAGAGCCACTCTTAGATATGAGAACCCCTCTCACTCCCTCTCAATCACACTCCCGTGTTTACAATCAAATCAAAAACACACCAGAGATTGCTCTCTGAACAAAAGAGATCAACTCTACACACTAGAGATCAACTCTACACACAAGAGATCAACTCTACACACAAGAGATCAACTCTACACACTCAGGTCCAACACTTGATGTTAGGGTACCATCAAGGTGGCTCACAAAAGACTCAAGTCCCAAAACTCACAAAATAACTCTTCAATCCCGGACTTGGTGCAGAACTCGTGCAGCCTTCATGTTTATATAGCAGTGTGCGTATCTGGGCTGCAACTTCTTGATGCTGGATGAGATCTATCATTTTCCTGAAAATCTGCACTTAAAGATCTAAAAGATACAGTTTGATCTTTTAGTTTTTATCTTTAATCTTTAATCCCTGAACGAACTATTCAAGTTTGTAATTCGAACTTTAATTATCTTTTAATTCGTTCCAAGAGATAGATCGTCTAATCTGTTGCTAACTGCACAATAATCTGTTAAAGAGATAACAGATTTATGTGTCCAGTATTTTCGGGCCGGATGTCAGGACATCGTGGATCCTGCACACTCTGTTAAAGATATAACAGATTTATGTGTCCAGTATTTTCGAGCCGGATGTCAGGACATCGTATCCGACATCGTGGATCCTGCAGCTTCAATTCTTCATTTGACATTTTATCTTGCCTTGTGCATTGTGCAGCCCGATCTGATTCCTTGACATAACGTTGGACATCATGTGCAGCAACTCCAGCTTTCCTTCATTGTCTAAGTGCTTATGTTTTAACAAAATCTTAGCCAATCTTTTAAACACTCAGTAGAGCTAAGCACTAAGAAGGGCACGGATGCAGGTCAGGAGTTCGTTGGGTGAACTTTAAAAGGTAGGAGAAACATTTTGAGAAAATTGTTTCAGACTTGGTGTGTCCCTGTTGGGAAGGGAACACAAGTTCTTAGGAGATGTTGTTCTTTGGGTGCCAGAGAATTAGGGAAAAGTCTTTATTCTTCTTCTCTTCTTCTTTAGAGAGAATTTTCTCTATCTAACATGTTACTTCCAGTTAGTTTTAAGGAATAACTTATACGTTCACCAACTGGTTTCTTAGAAAAACTCTCCTATTCTGTCAAGTCAAGTCTTTTTTGAGAATTAGAAAACTTTCCTATTTTTCAAAAACCCGTTACTTTGAGGGAAATTCAAGCACAAAGGTTCAGTGGAAACTACCAGTAAGTATTTTGTAAAACTGGCCGAACCAGTATAAATCAAGTTTGCATTCCTCTCTTCCCTTAAACTTCTTTTATTTATAGCTATTTATCTTTTGCTTTAAAGAAGTTTATTTTGAATTATCTTTTGAGTAATTCATGTTAAGGGTGCATTGTTAATCCGAAGAGAGTGAAAGTTTAATTGGGGAATAGTCTTCGTATCTTAATTCCACCCCCCCTTTTTTTTCTTAAGGTAACTTAGGCCATTTATCCAACATCCTATTCTTGATAACTCACTTCTCTCTAAAAAGACAAACTTTCCGGAATGATGAAATGAGGACACATGAACGTCTTGAAAACACAGTCAATCAAATGCTTTTTTTTTCTTTTTGAACTCTTTTTTTTTATTTTTTTATTTTGAAAATTATTTGTTTTGAACTTTACTCGTTGTTTTACGGCACCCCCACCAACGTGCAAGACGAGTAATCTCTGATTGAACAGTCTTGGAAGTCCAAACTCAGGAGCACAGGTCGCTTGAGCAAACAAACCAATGGCTTGCACCCACATCCCAGTGGAAGCAAAGATGTAATTACGAGAGGATGAGGGACAAAGATGTCAAATTTATCCATCTATTTAGCGTTGTAACTGTGGTTTACAATAATGGTATAAACTTGAAAATCCTGATGAGTCATTAGAGACATCTA

then I check the sam file in IGV, it seems didn't mapped exactly . here is the picture I export from IGV.
igv_snapshot
could you help my fix it ? Think you very much.

@mnshgl0110
Copy link
Member

I am not sure I get the issue. What I understand is that the the alignments (SAM) file do not have the inversion, but the SyRI output (VCF) has it. Is this the case?

If the inversion (inverted alignments) are not in the SAM file, then SyRI cannot find them. So, I am a bit confused about what is happening here. Are you sure you are checking all alignments in IGV? Maybe, check the SAM file directly for the presence of the inverted alignments in this region.

@shiyi-pan
Copy link
Author

Thank you for reply me . The problem is , the alignments (SAM) file do not have the inversion and the SyRI output (VCF) has not detected it too. I generated 900 inversions , but none of them detected by minimap2 and syri pipeline . I check all alignment in IGV. I check the inversion and the flanking sequences. the flanking sequences mapped well and the inversion where I marked in red line didn't align to the reference genome. Could you help me fix it? Thank you again.
Inkedigv_snapshot_LI

@mnshgl0110
Copy link
Member

This seems to be an issue with the minimap2. Check here. Are you using minimap2 version 2.20 or newer? Because those are what causing the issue.

You can ask the minimap2 repo for solutions, maybe there is some way to increase the sensitivity of the aligner so that these inverted regions align properly. Or you can try using nucmer.

@shiyi-pan
Copy link
Author

shiyi-pan commented Oct 1, 2021

Thank you a lot. I used minimap2 version 2.13. I first try using nucmer, but it takes too much memory, so my job always is killed, is there anaway to reduce the memory ?

@shiyi-pan
Copy link
Author

Hi, I have a new discovery,it seems that the inversion I generate is defined as HDRs.
here part of my inversion positions:

chrosome start length
Chr01 1841575 2073
Chr01 3025197 6255
Chr01 3311669 5753
Chr01 4690040 8605
Chr01 5474047 3872

here is part of syri vcf file:

#CHROM POS ID REF ALT QUAL FILTER INFO
Chr01 1 SYNAL1 N . PASS END=1841575;ChrB=Chr01;StartB=1;EndB=1841575;Parent=SYN1;VarType=.;DupType=.
Chr01 1 SYN1 N . PASS END=59720365;ChrB=Chr01;StartB=1;EndB=59720365;Parent=.;VarType=SR;DupType=-
Chr01 1841576 HDR1 N . PASS END=1843648;ChrB=Chr01;StartB=1841576;EndB=1843648;Parent=SYN1;VarType=ShV;DupType=.
Chr01 1843649 SYNAL2 N . PASS END=3025198;ChrB=Chr01;StartB=1843649;EndB=3025198;Parent=SYN1;VarType=.;DupType=.
Chr01 3025199 HDR2 N . PASS END=3031451;ChrB=Chr01;StartB=3025199;EndB=3031451;Parent=SYN1;VarType=ShV;DupType=.
Chr01 3031452 SYNAL3 N . PASS END=3311674;ChrB=Chr01;StartB=3031452;EndB=3311674;Parent=SYN1;VarType=.;DupType=.
Chr01 3311675 HDR3 N . PASS END=3317417;ChrB=Chr01;StartB=3311675;EndB=3317417;Parent=SYN1;VarType=ShV;DupType=.
Chr01 3317418 SYNAL4 N . PASS END=4690040;ChrB=Chr01;StartB=3317418;EndB=4690040;Parent=SYN1;VarType=.;DupType=.
Chr01 4690041 HDR4 N . PASS END=4698645;ChrB=Chr01;StartB=4690041;EndB=4698645;Parent=SYN1;VarType=ShV;DupType=.
Chr01 4698646 SYNAL5 N . PASS END=5474047;ChrB=Chr01;StartB=4698646;EndB=5474047;Parent=SYN1;VarType=.;DupType=.
Chr01 5474048 HDR5 N . PASS END=5477919;ChrB=Chr01;StartB=5474048;EndB=5477919;Parent=SYN1;VarType=ShV;DupType=.
Chr01 5477920 SYNAL6 N . PASS END=5967067;ChrB=Chr01;StartB=5477920;EndB=5967067;Parent=SYN1;VarType=.;DupType=.
Chr01 5967068 HDR6 N . PASS END=5968772;ChrB=Chr01;StartB=5967068;EndB=5968772;Parent=SYN1;VarType=ShV;DupType=.
Chr01 5968773 SYNAL7 N . PASS END=6453372;ChrB=Chr01;StartB=5968773;EndB=6453372;Parent=SYN1;VarType=.;DupType=.

When I generate the inversion, I just turn the corresponding sequence reverse using python list list[::-1]. Did I misunderstand the inversion? What's the meaning of HDR and How to discriminant in syri? I'm very confused , please help me clear them. Thank you very much.

@mnshgl0110
Copy link
Member

Hi. Actually, "Inversions" do not correspond to reversed sequence, rather they are reverse complement. For example:

Seqeunce:  ATGCATGC
Reverse:   CGTACGTA
Inversion: GCATGCAT

The mutated sequences that you have created are effectively not inversions of the original sequence. You can check here how to reverse complement sequence in python

@shiyi-pan
Copy link
Author

Thank you for your answer, it helps me a lot . When I correct my code , the 82% of created inversions could be detected by syri, indicating It is an awesome tool to detect SVs , but most of the undetected inversions were still defined as HDRs. Is any method to find the lost inversions ?
thank you again.

@mnshgl0110
Copy link
Member

Did you check the alignments? Are there inverted alignments overlapping these inversions?

@shiyi-pan
Copy link
Author

Thank you for your reply. I check three inversions defined as HDRs. It seems there is no alignments. Here is the igv_snapshot.
By the way, all the INV I generate is also defined as INVAL.
igv_snapshot.zip

Chr01 1712037 INV850 N . PASS END=1714564;ChrB=Chr01;StartB=1712037;EndB=1714564;Parent=.;VarType=SR;DupType=-
Chr01 1712037 INVAL1192 N . PASS END=1714564;ChrB=Chr01;StartB=1714564;EndB=1712037;Parent=INV850;VarType=.;DupType=.

What's the difference between INV and INVAL , Why the same seq have two types ?

@mnshgl0110
Copy link
Member

Using the latest version of minimap2 repo (HEAD) should solve this issue. Check this lh3/minimap2#816

Regarding INV vs INVAL, check this

@shiyi-pan
Copy link
Author

Thank you for your reply.
I use minimap2 2.22, it seems don't work, the worse is, it generate more HDRs(647) than before(337).
Here is the igv_snapshot:
HDR3-minimap2_v2.22.zip

@mnshgl0110
Copy link
Member

This would be the latest version 2.22-r1110-dirty. Did you get worse results with this?

@shiyi-pan
Copy link
Author

shiyi-pan commented Oct 13, 2021

Thank you for your reply. I don't know the meaning of "dirty", but I download the version 2.22-r1110 and get worse results.
By the way, I check the distribution of the inversions size defined as HDRs. I seems when it less than 5kb, 50% and 70% of the inversions will be defined as HDRs.
when I check translocations, it looks same as inversions . when it less than 5kb, a lot of translocations will be defined as HDRs.
I know this is not the problem of the syri , the key point is Alignment, but I wish you could give me some advise. Thank you again.

@mnshgl0110
Copy link
Member

I tested minimap2 for smaller rearrangements and it was working fine. So, I am a bit surprised why it is not working for your data. Maybe, you share the example sequence with the minimap2 community.
Nevertheless, you can try using nucmer to get alignments. Use the nucmer3 (old version). It is slower than nucmer4, but is much more memory efficient and is more accurate.

@shiyi-pan
Copy link
Author

Thank you for your reply. Here is one chrosome of my data. Would it be convenient for you to check it for me?
Here is my script:

MINIMAP=/ds3512/home/panyp/ruanjian/minimap2/2.13
PATH_TO_SYRI=/ds3512/home/panyp/NN1138-2/05.sv_detection/04.syri_data/syri/syri/bin/syri
DIR=/ds3512/home/panyp/NN1138-2/data/pan_genome

$MINIMAP/minimap2 -t 4 -ax asm5 --eqx Reference.fa TranslocationReference.fa > Reference2Translocation.sam
python3 $PATH_TO_SYRI -c Reference2Translocation.sam -r Reference.fa -q TranslocationReference.fa -k -F S --prefix Reference2Translocation

The ReferenceChr01.fa is the reference genome ,and I create TranslocationChr01.fa from it by generate 50 Chromosome internal translocation. The TranslocationChr01.txt file recode the translocation information . There are seven columns, representing the chrosome, startA, endA, sequenceA, startB, endB, sequenceB of translocation. The Reference2TranslocationsChr01.vcf is the result , you can check the HDRs.
TranslocationChr01.zip

ReferenceChr01.zip

TranslocationChr01.txt
Reference2TranslocationsChr01.vcf.zip

@shiyi-pan
Copy link
Author

Here is the script and result of using minimap2 2.22:

MINIMAP=/ds3512/home/panyp/NN1138-2/05.sv_detection/05.SV_Accuracy/INV_Accuracy/minimap2-2.22
PATH_TO_SYRI=/ds3512/home/panyp/NN1138-2/05.sv_detection/04.syri_data/syri/syri/bin/syri
DIR=/ds3512/home/panyp/NN1138-2/data/pan_genome

$MINIMAP/minimap2 -t 4 -ax asm5 --eqx ReferenceChr01.fa TranslocationChr01.fa > Reference2TranslocationChr01.sam
python3 $PATH_TO_SYRI -c Reference2TranslocationChr01.sam -r ReferenceChr01.fa -q TranslocationChr01.fa -k -F S --prefix Reference2TranslocationChr01

Reference2TranslocationChr01syri.vcf.zip

By the way, my log file have some RuntimeWarning, does it matter?

[M::mm_idx_gen::2.2300.81] collected minimizers
[M::mm_idx_gen::2.442
1.07] sorted minimizers
[M::main::2.4421.07] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::2.515
1.07] mid_occ = 142
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::2.5721.07] distinct minimizers: 4147356 (90.94% are singletons); average occurrences: 1.450; average spacing: 9.928; total length: 59720365
[M::worker_pipeline::131.796
1.00] mapped 1 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: /ds3512/home/panyp/NN1138-2/05.sv_detection/05.SV_Accuracy/INV_Accuracy/minimap2-2.22/minimap2 -t 4 -ax asm5 --eqx ReferenceChr01.fa TranslocationChr01.fa
[M::main] Real time: 131.814 sec; CPU: 131.594 sec; Peak RSS: 3.224 GB
/ds3512/home/panyp/ruanjian/python35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/ds3512/home/panyp/ruanjian/python35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/ds3512/home/panyp/ruanjian/python35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
invOut.txt is empty. Skipping analysing it.
invTLOut.txt is empty. Skipping analysing it.
ctxOut.txt is empty. Skipping analysing it.
inv Out.txt is empty. Skipping analysing it.
invTL Out.txt is empty. Skipping analysing it.
dup Out.txt is empty. Skipping analysing it.
ctxOut.txt is empty. Skipping analysing it.
invOut.txt is empty. Skipping analysing it.
invTLOut.txt is empty. Skipping analysing it.
ctxOut.txt is empty. Skipping analysing it.
Finished syri

Thank you a lot.

@mnshgl0110
Copy link
Member

Hi Shiyi. The newer versions of minimap2 are aligning translocations as long indels, which obviously results in SyRI not able to identify these translocations. You can try optimising minimap2 parameters to align small translocations correctly. Probably -r parameter of minimap2 could do that. Alternatively, you can consider using nucmer. The nucmer3 is more memory efficient so should not give you problems, but is comparatively much slower.
I tested the below pipeline

nucmer3 --maxmatch -c 100 -b 500 -l 50 ReferenceChr01.fa TranslocationChr01.fa
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta
syri -c out.filtered.coords -d out.filtered.delta -r ReferenceChr01.fa -q TranslocationChr01.fa -k -F T --prefix Reference2Translocation_nucmer

and the output is much better.
Reference2Translocation_nucmer_syri.vcf.txt

I will have to test how to use minimap2 for SR calling but that would take some time.

@shiyi-pan
Copy link
Author

Thank you for your help. I will try nucmer3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants