memory peak #9

dcopetti · 2018-12-06T10:18:26Z

Hello,
From the paper, I see that Hercules needs lots of memory when correcting WGS data. Which is the step that is more memory-intensive? Is it maybe related to the sort step - that would be easier to adapt to smaller computational resources.
Thanks

canfirtina · 2019-01-01T23:50:23Z

Hi @dcopetti,

Usually, the most memory intensive steps are (1) the implementation of the Forward-Backward algorithm (i.e., the correction step) and (2) even loading FAI file to a memory (e.g., "too" large short read files). For the first case, allocation of multi-dimensional arrays creates an overhead during the Forward-Backward calculation (per read). Since the correction of each long read is handled in parallel, memory overhead would increase as the number of threads allocated for Hercules increases.

I believe there are many possibilities for the optimization in how Forward-Backward values are stored per read, and calculation of the Viterbi algorithm, which hopefully will be released in the newer versions of Hercules.

Thanks.

dcopetti · 2019-01-02T10:19:33Z

Hi,

Thanks for the details.
If the error correction step is parallelized, would it be possible to split the long.fasta file in a few chunks and run the -2 step multiple times (not in parallel, to avoid requesting again too much memory)?
Thanks,

canfirtina · 2019-01-03T10:26:17Z

Hi again @dcopetti ,

Yes that could be one option as long as you also divide an alignment file so that each sub-alignment file includes the alignment of short reads only to the long reads that will be corrected within the same run.

Thanks.

dcopetti · 2019-01-03T13:05:28Z

Hi,
What is the reason for which there can not be "extra" lines in the bam file? I mean, if a line has a long read ID that is not present in the reference (the subset of long reads), will that cause a problem to hercules -2?

canfirtina · 2019-01-03T14:29:40Z

It will most probably cause an error. Before answering your question let me describe how it works a little bit further. This is related to how Hercules generates the preprocessing files and uses the alignment file later on the correction step (-2). The alignment file that is generated using preprocessing files includes some integer values (ids) on the REF field instead of the name of long reads. These integer values correspond to the order of long reads as they appear in the original set of long reads. For example, if REF field includes "0" this means a short read aligns to the 1st long read (0-based numbering) in the original set of long reads. Hercules extracts long read sequences based on these values in the REF field instead of using the name of a read (i.e., read id) due to some optimisation reasons. Thus, if you change the order of long reads after generating the preprocessing files, you should be aware that Hercules may be fetching wrong read sequences.

In the case of your question, if an alignment file includes a read id (must be an integer value) that actually is not present in your set of long reads, it will not cause any problem iff this read id is greater than or equal to the number of reads included in the long read set. Otherwise, it will definitely point to a read and it will try to correct that read using the short reads aligning to that read id. Thus, you should always keep the order of the reads in mind and if you change the order of a read, you should also change REF field accordingly so that the values in the REF field still point to correct long reads.

The code snippet that makes the assumptions that I just described is as follows:

Note that Hercules reads long reads from the beginning by keeping track of the order of the reads and checks whether the REF field in the next alignment record matches with the long read that is currently being processed.

hercules/src/main.cpp

Lines 992 to 1009 in e0a7b39

    
           {//block for obtaining new index for the next long read 
        
               std::lock_guard<std::mutex> lk(indexMutex); 
        
               index = readCount; 
        
               if(index < size){ 
        
                   //read the current long read's original sequence 
        
                   readSequence(curSeq, longIndex, index); 
        
                   //read the current long read's original id 
        
                   curSeqId = sequenceName(longIndex, index); 
        
                   int seqId = -1; 
        
                   //id of the last "unproccessed" record in the alignment file. Id-names 0-based as well 
        
                   if(!atEnd(alignmentFileIn)) 
        
                   	seqId = std::stoi(toCString(getContigName(record, alignmentFileIn))); 
        
                   //did we hit the current alignment or there is no alignment for the current long read? 
        
                   //since the alignment file is **sorted** we can assume there is no alignment if index+1 < seqId 
        
                   if(seqId < 0 || index < seqId) shouldCorrect = false; 
        
                   else if(index == seqId){ //there is an alignment. so read the aligned short reads.

dcopetti · 2019-01-04T09:02:42Z

I see.
So how about splitting the long reads file before hercules -1? A smaller "reference" from the beginning should have no effect on the alignment and numbering of the reads, just present a lower coverage of long reads.
What do you think?

canfirtina · 2019-01-04T13:42:35Z

I agree, it is better to split the reads before hercules -1. This should not have an effect on the alignment as you suggested.

These were valuable feedbacks for us as well. I will consider making the parameters more intuitive so that one can run Hercules on a partial set of reads instead of using the whole read sets in order to prevent possible overflows.

Thanks.

dcopetti mentioned this issue Dec 27, 2018

better use trimmed short reads? #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory peak #9

memory peak #9

dcopetti commented Dec 6, 2018

canfirtina commented Jan 1, 2019

dcopetti commented Jan 2, 2019

canfirtina commented Jan 3, 2019 •

edited

Loading

dcopetti commented Jan 3, 2019

canfirtina commented Jan 3, 2019

dcopetti commented Jan 4, 2019

canfirtina commented Jan 4, 2019

memory peak #9

memory peak #9

Comments

dcopetti commented Dec 6, 2018

canfirtina commented Jan 1, 2019

dcopetti commented Jan 2, 2019

canfirtina commented Jan 3, 2019 • edited Loading

dcopetti commented Jan 3, 2019

canfirtina commented Jan 3, 2019

dcopetti commented Jan 4, 2019

canfirtina commented Jan 4, 2019

canfirtina commented Jan 3, 2019 •

edited

Loading