Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demultiplexing fast5 #46

Open
aspitaleri opened this issue Mar 9, 2021 · 19 comments
Open

demultiplexing fast5 #46

aspitaleri opened this issue Mar 9, 2021 · 19 comments
Assignees
Labels
question Further information is requested

Comments

@aspitaleri
Copy link

Hi
From #14 It seems that MotifSeq could demultiplex a bundle of fast5 to retrieve the fast5 for each strains. Is this right? Or there is some options in Squiggle Kit to do the same?
Thanks

@Psy-Fer Psy-Fer self-assigned this Mar 11, 2021
@Psy-Fer Psy-Fer added the question Further information is requested label Mar 11, 2021
@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 11, 2021

It depends how you are demultiplexing, and what you mean by strains?

More information is required for me to answer your question.

@aspitaleri
Copy link
Author

Hi
basically I have a bundle of fast5 files from a MinIon run which includes sequencing from different bacterial strains (i.e. samples). Normally, I do basecall and then demultiplex using guppy on the fastq. Now, I'd like to perform directly on the fast5 the demutliplex, so divide them per barcode without passing through the basecall.
Hope it is clear.

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 11, 2021

oooh right. fast5_fetcher_multi paired with ont_fast5_api would be the tool for that.

for each barcodeXX.fastq file do something like this

mkdir dmux_barcode01_single

# extract the individual fast5 files
python3 fast5_fetcher_multi -q barcode01.fastq -s sequencing_sumary.txt -m /path/to/fast5s/ -o ./dmux_barcode01_single/

# package the individual files up again (I really should just do this in fast5_fetcher...one day)
single_to_multi_fast5 -i dmux_barcode01_single/ -o dmux_barcode01_multi --filename_base barcode01

# remove intermediate fast5 files
rm -r dmux_barcode01_single/

This will used the readIDs in the demultiplexed fastq file, match them with the fast5 filenames in the sequencing summary, and find them in the path given with -m and saved to directory -o. The output directory should be made before running fast5_fetcher_muilti

Then the ont_fast5_api has a script called single_to_multi_fast5 which will pack the fast5s extracted into multi files again.

Note, that if you are on a system with hard file number limits, like a HPC, check how many reads are in each barcodeXX.fastq file, as each read will make 1 fast5 file. So you could hit limits. If that is an issue, you can split the file up and run in parts. Or extract the readIDs manually and use ont_fast5_api only to extract the reads.

I hope this helps.

@aspitaleri
Copy link
Author

Right. So there is not possibility to avoid to go through basecalling/demultiplexing first, without using fastq files. Actually MinIon makes a sequencing_summary.txt during the run when generating fast5 only. Could I use that file to call each reads per barcode.

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 11, 2021

Ahh, well the only DNA signal level barcode out there I know of is Deepbinner. But it's depricated now if I remember correctly.

Motifseq isn't sensitive enough to do it as well as base level demultiplexing.

So no, there isn't really an easy way to avoid basecalling.

@aspitaleri
Copy link
Author

So the approach described here https://psy-fer.github.io/SquiggleKitDocs/MotifSeq/#background in the Nanopore adapter identification is not useful for this.

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 12, 2021

It would work, yes, but not as effectively as a base level derived demultiplexer. Only a system using some form of machine learning/learning like used in Deepbinner or what we have done with deeplexicon, would get similar or better results.

Is there a particular reason to do this? Perhaps there is another solution.

@aspitaleri
Copy link
Author

Well, my purpose is to bypass the basecalling in order to reduce one source of error and then use uncalled pipeline (https://github.com/skovaka/UNCALLED) to map fast5 on genome reference, i.s. amplicon analysis. That's why I need to demultiplex a MinIon run in the different barcodes before to map it.

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 12, 2021

Uncalled uses the Readuntil api, are you planning to do the demultiplexing in real time? Or are you looking to run uncalled after a run?

The accuracy of uncalled is not as good as basecalling and aligning, as the base sequence it uses is only an approximation.

@aspitaleri
Copy link
Author

The idea is to run it after run on amplicons so on huge depth (>4000), and then compare with standard procedure to check whether the approach is feasible of course. Thanks for your comments

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 12, 2021

If you want to benchmark to see how well it does, you can use the regular demultiplexing data to split the uncalled data output and assess that way. Then if it is better, look into the demultiplexing with signal.

There is a possibility for me to extend deeplexicon algorithms to DNA, rather than just RNA.

@aspitaleri
Copy link
Author

Let me see if I understood well. Basecall/demultiplex the fast5 using i.e. guppy. Then as you suggested #46 (comment) get the fast5 per barcode using the sequencing_summary and then use the uncalled pipeline to get the fasta. Finally compare the results. Right?

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 12, 2021

Sounds about right yea. Plus the fiddly bits in between. Good luck!

@aspitaleri
Copy link
Author

Yep! I will update you how it does.
In case it is better, we need then to think how to avoid the step of basecalling ... but this in another story.
Thanks a lot for you help and comments

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 12, 2021

You are welcome.

If it is the case, I'll build a demultiplexer

@aspitaleri
Copy link
Author

Uaooo - that sounds great really. Keep in touch then!

@aspitaleri
Copy link
Author

I see indeed that you have similar but for RNA
https://github.com/Psy-Fer/deeplexicon. Good to know

@Psy-Fer
Copy link
Owner

Psy-Fer commented Mar 30, 2021

Yes.

I'm going to extend that to DNA. Planning to have something in a few months.

@aspitaleri
Copy link
Author

That's great! I will wait for your tool. If you need to debug before to release it - I will be happy to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants