Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in GenotypeGVCFs with -all-sites #8989

Open
brisk022 opened this issue Oct 1, 2024 · 8 comments
Open

Memory leak in GenotypeGVCFs with -all-sites #8989

brisk022 opened this issue Oct 1, 2024 · 8 comments
Assignees

Comments

@brisk022
Copy link

brisk022 commented Oct 1, 2024

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs with -all-sites

Affected version(s)

  • 4.2 through 4.6

Description

We tried to run GenotypeGVCFs from GATK 4.5 with -all-sites on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.

Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)

memory_usage

Time is in seconds, memory is in MB.

If I run the same command without -all-sites, the maximum memory usage is around 1.6 GB.

Steps to reproduce

GenomicDB was created using the corresponding GATK version as:

gatk --java-options "-Xmx12000m" GenomicsDBImport --genomicsdb-workspace-path tmp/genomicsdb44/9 \
    --genomicsdb-shared-posixfs-optimizations --batch-size 120 --verbosity DEBUG \
    -L 9 -V data/gatk/gvcf/9/1.g.vcf.gz -V data/gatk/gvcf/9/2.vcf.gz -V data/gatk/gvcf/9/3.g.vcf.gz \
    -V data/gatk/gvcf/9/4.g.vcf.gz -V data/gatk/gvcf/9/5.g.vcf.gz -V data/gatk/gvcf/9/6.g.vcf.gz \
    -V data/gatk/gvcf/9/7.g.vcf.gz -V data/gatk/gvcf/9/8.g.vcf.gz

GenotypeGVCFs was run as:

gatk --java-options "-Xmx12g" GenotypeGVCFs -R data/ref/hs37d5.fa.gz \
    -V gendb://tmp/genomicsdb44/9 -O data/gatk/variants/9/raw44.vcf.gz -L 9 \
    --tmp-dir ./tmp/tmp -all-sites

All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.

@gokalpcelik
Copy link
Contributor

Can you reduce the maximum number of alleles per site when you run this analysis?

@brisk022
Copy link
Author

brisk022 commented Oct 2, 2024

Sure, below are the results when running with --max-alternate-alleles 5

memory_usage_ma5

@lbergelson lbergelson added the bug label Oct 16, 2024
@gokalpcelik
Copy link
Contributor

Hi @brisk022

There is an update for this issue. We were able to recreate this problem in our hands and looks like there is a memory management issue somewhere in the GenomicsDB related code inside GenotypeGVCFs.

Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using

gatk SelectVariants -V gendb://instancename -O GVCF_export.g.vcf.gz -R ref.fa -L whateverintervalusedinGDBimport

and later using this GVCF file as input for the GenotypeGVCFs tool. This ensures that memory usage won't go above unreasonable levels and won't cause any appearant leaks.

I hope this helps.

Regards.

@lbergelson
Copy link
Member

@nalinigans: We now believe this is actually a GenomicsDB issue (or possibly an issue in the JNI layer.)

@gokalpcelik was able to reproduce this problem on a set of 330 whole exomes. He found that if he ran GenotypeGenotypeGVCs from a GenomicsDB the memory usage climbed up to 10s of GB, but the java heap memory remained constant. He then tested firt extracting the combined GVCF from genomics db and then running GenotypeGVCFs and saw that memory usage for GenotypeGVCFs remained constant at 1 G. So we think this is probably a GenomicsDB issue.

GenomicsDBImport > GenotypeGVCFs ---- Memory ramps up immediately to 10s of gigabytes
GenomicsDBImport > SelectVariants to GVCF > GenotypeGVCFs ---- Memory is fixed at 1.1 GB

He can fill in more detail about the exact configuration if it helps.

@lbergelson
Copy link
Member

Between 4.1.0.0 and 4.2.0.0 we moved from GenomicsDB 1.0.0-rc2 -> 1.3.2.

@nalinigans
Copy link
Collaborator

nalinigans commented Oct 23, 2024

@lbergelson @gokalpcelik any chance of giving me access to the workspace for the 330 whole exomes?

@brisk022
Copy link
Author

Thanks, @gokalpcelik ! I tested the workaround and indeed when used with a gvcf file rather than GenomicsDB the memory consumption remains reasonable. I only tried GATK 4.6 but it is probably the same with the other versions that have the issue.

@gokalpcelik
Copy link
Contributor

@lbergelson @gokalpcelik any chance of giving me access to the workspace for the 330 whole exomes?

Hi @nalinigans
Unfortunately this is on my private company server but I may be able to conduct tests if you need me to. I can generate a fork of gatk and update GenomicsDB to test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants