Potential bug when using paired-end files #17

mapo9 · 2024-01-05T16:02:39Z

Hi,
I found some weird behaviour when running paired-end data.

To test some stuff, I created some simulated datasets where I know all the parameters like repertoire size, size of each clonotype, V gene, J gene etc.
I created the fastq's as paired-end seq files
and ran catt with the following command catt --f1 test_R1.fastq --f2 test_R2.fastq -o test_out -t 20.

The unintended behaviour I experienced can nicely be seen in one of my samples with a repertoire of one clonotype with 10.000 clones.
Catt returns 3 clones with exactly equal NNseq:
AAseq,NNseq,Prob,Vregion,Jregion,Dregion,Frequency CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*02,TRBJ2-1*01,TRBD1*01,6727 CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*06,TRBJ2-1*01,TRBD1*01,6727 CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*03,TRBJ2-1*01,TRBD1*01,6546

When combining the "different" clonotypes into one the frequencies sum up to 20.000 clones instead of 10.000.
So, it seems like catt is counting each clone twice

I thus merged the paired end files to a single file using pear and repeated the analysis.
This returned the same results as the paired-end run, only the frequencies are different.
AAseq,NNseq,Prob,Vregion,Jregion,Dregion,Frequency CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*06,TRBJ2-1*01,TRBD1*01,3379 CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*02,TRBJ2-1*01,TRBD1*01,3348 CSARGDRGLSYNEQFF,TGCAGTGCTCGGGGGGACAGGGGGCTATCCTACAATGAGCAGTTCTTC,0.00016,TRBV20-1*03,TRBJ2-1*01,TRBD1*01,3273
For the merged "single-end" files the clones sum up to the expected 10.000.
What confuses me a little though that the counts aren't exactly half of the ones in the paired-end run

I guess that there must be some issue when counting the frequency for the paired-end samples.

Would be awesome if you could have a look!
Thanks!

The text was updated successfully, but these errors were encountered:

songofbin · 2024-11-05T10:25:59Z

If I understand correctly, the definition of a TCR clone in the CATT results takes into account the differences in V and J genes in addition to cdr3aa. The problem is that the delineation of V and J genes may sometimes be inaccurate, resulting in the separation of a same TCR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug when using paired-end files #17

Potential bug when using paired-end files #17

mapo9 commented Jan 5, 2024

songofbin commented Nov 5, 2024

Potential bug when using paired-end files #17

Potential bug when using paired-end files #17

Comments

mapo9 commented Jan 5, 2024

songofbin commented Nov 5, 2024