Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory peak #9

Open
dcopetti opened this issue Dec 6, 2018 · 7 comments
Open

memory peak #9

dcopetti opened this issue Dec 6, 2018 · 7 comments

Comments

@dcopetti
Copy link

dcopetti commented Dec 6, 2018

Hello,
From the paper, I see that Hercules needs lots of memory when correcting WGS data. Which is the step that is more memory-intensive? Is it maybe related to the sort step - that would be easier to adapt to smaller computational resources.
Thanks

@canfirtina
Copy link
Member

Hi @dcopetti,

Usually, the most memory intensive steps are (1) the implementation of the Forward-Backward algorithm (i.e., the correction step) and (2) even loading FAI file to a memory (e.g., "too" large short read files). For the first case, allocation of multi-dimensional arrays creates an overhead during the Forward-Backward calculation (per read). Since the correction of each long read is handled in parallel, memory overhead would increase as the number of threads allocated for Hercules increases.

I believe there are many possibilities for the optimization in how Forward-Backward values are stored per read, and calculation of the Viterbi algorithm, which hopefully will be released in the newer versions of Hercules.

Thanks.

@dcopetti
Copy link
Author

dcopetti commented Jan 2, 2019

Hi,

Thanks for the details.
If the error correction step is parallelized, would it be possible to split the long.fasta file in a few chunks and run the -2 step multiple times (not in parallel, to avoid requesting again too much memory)?
Thanks,

@canfirtina
Copy link
Member

canfirtina commented Jan 3, 2019

Hi again @dcopetti ,

Yes that could be one option as long as you also divide an alignment file so that each sub-alignment file includes the alignment of short reads only to the long reads that will be corrected within the same run.

Thanks.

@dcopetti
Copy link
Author

dcopetti commented Jan 3, 2019

Hi,
What is the reason for which there can not be "extra" lines in the bam file? I mean, if a line has a long read ID that is not present in the reference (the subset of long reads), will that cause a problem to hercules -2?

@canfirtina
Copy link
Member

It will most probably cause an error. Before answering your question let me describe how it works a little bit further. This is related to how Hercules generates the preprocessing files and uses the alignment file later on the correction step (-2). The alignment file that is generated using preprocessing files includes some integer values (ids) on the REF field instead of the name of long reads. These integer values correspond to the order of long reads as they appear in the original set of long reads. For example, if REF field includes "0" this means a short read aligns to the 1st long read (0-based numbering) in the original set of long reads. Hercules extracts long read sequences based on these values in the REF field instead of using the name of a read (i.e., read id) due to some optimisation reasons. Thus, if you change the order of long reads after generating the preprocessing files, you should be aware that Hercules may be fetching wrong read sequences.

In the case of your question, if an alignment file includes a read id (must be an integer value) that actually is not present in your set of long reads, it will not cause any problem iff this read id is greater than or equal to the number of reads included in the long read set. Otherwise, it will definitely point to a read and it will try to correct that read using the short reads aligning to that read id. Thus, you should always keep the order of the reads in mind and if you change the order of a read, you should also change REF field accordingly so that the values in the REF field still point to correct long reads.

The code snippet that makes the assumptions that I just described is as follows:

Note that Hercules reads long reads from the beginning by keeping track of the order of the reads and checks whether the REF field in the next alignment record matches with the long read that is currently being processed.

hercules/src/main.cpp

Lines 992 to 1009 in e0a7b39

{//block for obtaining new index for the next long read
std::lock_guard<std::mutex> lk(indexMutex);
index = readCount;
if(index < size){
//read the current long read's original sequence
readSequence(curSeq, longIndex, index);
//read the current long read's original id
curSeqId = sequenceName(longIndex, index);
int seqId = -1;
//id of the last "unproccessed" record in the alignment file. Id-names 0-based as well
if(!atEnd(alignmentFileIn))
seqId = std::stoi(toCString(getContigName(record, alignmentFileIn)));
//did we hit the current alignment or there is no alignment for the current long read?
//since the alignment file is **sorted** we can assume there is no alignment if index+1 < seqId
if(seqId < 0 || index < seqId) shouldCorrect = false;
else if(index == seqId){ //there is an alignment. so read the aligned short reads.

@dcopetti
Copy link
Author

dcopetti commented Jan 4, 2019

I see.
So how about splitting the long reads file before hercules -1? A smaller "reference" from the beginning should have no effect on the alignment and numbering of the reads, just present a lower coverage of long reads.
What do you think?

@canfirtina
Copy link
Member

I agree, it is better to split the reads before hercules -1. This should not have an effect on the alignment as you suggested.

These were valuable feedbacks for us as well. I will consider making the parameters more intuitive so that one can run Hercules on a partial set of reads instead of using the whole read sets in order to prevent possible overflows.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants