-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fusion Calling Error #5
Comments
Hi Evan, Thank you for trying out our tool. I can't easily see why you are getting the error for the It seems you have all the relevant files including the barcoded BAM file For each barcode, there will be ideally only one transcript and multiple PCR copies of that transcript (some of which might align to isoforms of the gene if you include them in reference FASTA such as RUNX1-205 and RUNX1-216 in our example). The scripts for barcode matching are the ones you identified. However to extract the candidates from the reads Let me know if this explanation, along with the README information, answers your questions. I'm happy to discuss this further if you would like to have a chat. I can potentially give your data a try if you would like to share it. Best, |
Hello, Concerning the Thank you again for your help, this is a great tool! Evan |
I'm glad it worked for you! Very interesting that you see so many cells with fusions! I am curious if you are able to see this only/mostly in the expected cells or if many other cells, which you don't expect to bear fusions, seem to have fusion transcripts too. The barcodes/reads you see reported as having fusion but not in your Seurat object may be some technical artifacts or weird combinations of errors that end up being reported as matching to a barcode in the whitelist. This portion of the reads should be small in terms of the fraction of all the reads compared to the reads that belong to the real cells. For example, 2000 barcodes each with one read vs. 1000 barcodes each with an average of 30 reads, so 2000 out of 32000 reads not accounted for. It's important to ensure that you filter potentially artifactual reads due to errors in sequencing/PCR. Also, depending on the conditions, these may be real molecules in the empty droplets. Regarding obtaining more reads matched to the barcodes, you can certainly include only the list of barcodes you have identified as cells. To do this, you can provide the This approach will increase the specificity on those cells and speed up the barcode alignment process due to a smaller reference size. However, there's a chance of losing some precision by forcing 'some' reads to match your whitelist erroneously. As mentioned above, these reads may be technical artifacts or molecules in empty droplets. A middle ground could be selecting all the barcodes associated with droplets, which usually range from 30k to 100k (including all barcodes with 5-10 UMIs and higher), compared to the entire whitelist of 737k. I have an idea of why the Let me know how if you had any other questions. Mehdi |
Hello, Thank you for all the explanation again that's very useful to know! However, I have noticed that 8~9% of cells in other clusters not labeled as cancer cells are also harboring the fusion found by nanoranger. I am currently investigating this and trying to determine the reason for this mismatched occurrence. It could be due to many reasons as you suggested, or the use of raw barcodes instead of filtered barcodes in my case. There could also be a few mislabeled cells but I would not expect to have something as this level. I will investigate those specific cells. I will also use the exact fasta sequence of the fusion present in each sample found by FusionSeeker to determine if it can improve 1. the detection 2. lower the false positive as it will be a sample specific fusion sequence. Finally, I will try to filter out barcodes with a defined number of reads harboring the fusion to filter out false positives, as you mentioned. Do you think it's possible to have a quality or confidence score for the barcodes matching when using the whitelist of barcodes as input ? I saw there is file named In definitive, 90~91% of cells harboring the fusion going to the right cluster is very good, it means to me that our signature to find cancer cells is correct, if we can take this score up that would be great. I have also match normal samples that I will use as a negative control to see if the tool finds any cells harboring the fusion, it will helps me to define a threshold % of error. Thank you, |
Hi Evan, There's the alignment score that each candidate read gets. With 10x barcodes the highest is a perfect match (score 16 = 16nt matching). What you see in Some questions for you to think about the reason of observing fusion in non-malignant cells: How long is fusion transcript? assuming its entirety is captured in 10x cDNA (highly unlikely for transcripts longer than 2-3kb). Depending on the type of artifact and expression, you might be getting one UMI per non-malignant cell and many UMIs for malignant cells. Or you might be getting one real UMI with lots of supporting reads in the malignant cells and one fake UMI with 1 or 2 reads in the non-malignant. Let me know if you have other questions. Mehdi |
Hello,
Thank you for this tool. I have 5' 10x Library sequenced with Nanopore Sequencing. I previously used JAFFAL to recover known fusion from Single-Cell which works quite well and I wanted to use your fusion detection pipeline using a fasta file to see how it performs with it. However, I encounter this error message on my own data:
Suprisingly I encounter the same error with the test data
Here is my working environment
The files present in the output directory for my data so far are :
fusion_barcode_scores.csv fusion_barcode_scores.pdf fusion_bcumi_dedup.csv fusion_BCUMI.fasta.gz fusion_deconcat.fastq.gz fusion_genome_tagged.bam fusion_genome_tagged.bam.bai fusion_knee.pdf fusion_matching.sam fusion_trns_ct.csv
I was looking to have an output file with the reads + barcodes + presence of the fusion, but I'm not sure I've found this in any of these files. Do you have a wiki with the output files created and their content description? I guess I must use the fusion_gene.py in the downstream folder in scripts, but I am unsure of the arguments I need to fill in to use it.
Also related to the script you provide, what is the script performing the extraction of the 10x barcodes? I saw that there are two bash scripts barcode_align.sh and barcode_ref.sh so I imagine those two which are called right ?
Thank you for your help,
Evan
The text was updated successfully, but these errors were encountered: