-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No evaluation statistics for het and homalt calls during model training #876
Comments
Hi @DanJeffries , Can you please paste the output for the following command here: cat /home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt and also separately run the following and paste the output: cat /home/examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt |
Hi @kishwarshafin , Sure. Just a quick note first to explain the outputs, and maybe its relevant to the problem. Given the number of examples I couldn't easily perform the shuffling step on my local cluster (using DirectRunner) due to memory and wall time limits. So I performed a 2-step shuffle. I.e. I split the examples in half (parts 1 and 2), shuffled each half, then randomly split the outputs from each of the first shuffles into two halves (parts 3 and 4) and ran a second round of shuffling. I then edited the path in the pbtxt file to accommodate all file names. So
I assumed that the commented out lines are not read, so I added some extra to keep track of the various shuffling steps. and FYI,
gives the correct set of shuffled files:
And for the tuning set (which was shuffled normally using just one step):
and
gives the desired file list again:
Hope thats useful! |
I don't think shuffling is the issue in your training set. You can see that this is the class distribution for your training data: #name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-?????-of-?????.tfrecord.gz"
#num_examples: 727188
# class0: 727188
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt4.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4
# Meaning all of the examples you created are of class 0. Can you check if your truth VCF has 1/1 and 0/1 variants? |
For example, on a training set on our end, the pbtxt file's output looks like this: # Classes:
# class 0: 855059086
# class 1: 1443187583
# class 2: 947352461
# Indel or SNP:
# Indel: 1227997460
# SNP: 2017601670 Can you |
Hi @kishwarshafin , Ah ok that explains it. Yes I definitely have variants in my VCF, between 850k - 1.15M for each of 5 samples. Here is the output of Perhaps it is my make_examples command? I made examples for sample separately using the below command.
And here's an excert from this sample's VCF:
And I have attached the log file for this make_examples for this sample MAKE_EX_TRAIN_3148249-3.err.gz Thanks Dan |
@DanJeffries, sorry for the late reply I was traveling for a conference. Looking at the log, most of it looks like this: I0812 17:25:36.970339 140640080795456 make_examples_core.py:301] Task 18/20: 130000 candidates (5465 examples) [24.32s elapsed]
I0812 17:25:38.358597 140641138153280 make_examples_core.py:301] Task 17/20: 136011 candidates (5980 examples) [24.30s elapsed]
I0812 17:25:38.452311 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:39.808730 139930050549568 make_examples_core.py:301] Task 19/20: 130009 candidates (5415 examples) [19.65s elapsed]
I0812 17:25:40.161706 140719761319744 make_examples_core.py:301] Task 1/20: 130009 candidates (5656 examples) [26.55s elapsed]
I0812 17:25:39.762312 140656897058624 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:40.178942 139929751582528 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:41.815151 139685487241024 make_examples_core.py:301] Task 8/20: 134010 candidates (5603 examples) [20.32s elapsed]
I0812 17:25:42.062644 140656897058624 make_examples_core.py:301] Task 14/20: 130007 candidates (5538 examples) [26.62s elapsed]
I0812 17:25:42.944558 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:42.945389 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 944784.0, which exceeds max(=100000)
I0812 17:25:42.940323 140366879631168 make_examples_core.py:301] Task 6/20: 132005 candidates (5726 examples) [25.32s elapsed]
I0812 17:25:43.504171 139929751582528 make_examples_core.py:301] Task 4/20: 132003 candidates (5481 examples) [23.80s elapsed]
I0812 17:25:43.563577 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 131220.0, which exceeds max(=100000)
I0812 17:25:45.208675 140012542068544 make_examples_core.py:301] Task 12/20: 134003 candidates (5624 examples) [21.31s elapsed]
I0812 17:25:45.249171 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 275562.0, which exceeds max(=100000)
I0812 17:25:44.861580 140290187421504 make_examples_core.py:301] Task 0/20: 134001 candidates (5695 examples) [22.80s elapsed]
I0812 17:25:44.957572 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:44.749801 140624495118144 make_examples_core.py:301] Task 9/20: 132004 candidates (5675 examples) [26.91s elapsed]
I0812 17:25:45.658525 139725838563136 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:48.367301 140203250202432 make_examples_core.py:301] Task 3/20: 130018 candidates (5401 examples) [24.78s elapsed]
I0812 17:25:49.316477 140120687957824 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:50.556129 140585679882048 make_examples_core.py:301] Task 16/20: 132012 candidates (5581 examples) [21.74s elapsed]
I0812 17:25:51.780121 140641138153280 make_examples_core.py:301] Task 17/20: 138019 candidates (6057 examples) [13.42s elapsed] It really looks like you have a lot of variants, is this expected for your sample? What's happening here is that the haplotype_labler is trying to label the candidates but failing because there are way too many combinations and it's giving up. What is the truth that you are using for this sample? |
Hi @kishwarshafin , No worries, thanks for finding the time to get back to me! And thanks for the explanation - I had read that line in the log file "Not including more", as it is including some. So the truth data is generated from the offspring of 5 trios. Variants for the parents and offspring were called (using GATK4) against a reference, mendelian expectations were checked for each locus, and the loci that passed that check, as well as some hard filters (e.g. depth, GQ etc), were kept for the truth set. The variants in the truth set are the combination of all the variants that passed these filters in all 5 trios, which amounts to around 450,000 variants (split into ~350k training, and ~100k for tuning). The reference is from a different population which which will probably result in more hom_alt SNPs against the ref, but other than that, I don't think this number of SNPs is particularly high for a 400Mb genome of a wild fish with relatively large pop sizes. The 0.5x downsampling of course doubles this number, resulting in ~900,000 truth vars in total. So, could a solution be to increase the maximum for genotype_options_product? Or would you suggest subsampling the truth variants? Thanks! Dan |
Hi @DanJeffries , To start with, can you add this to your make_examples command and see if it helps: --labeler_algorithm=positional_labeler \ This should switch the labeling from haplotype to positional and it should solve this issue. Let me know if this works. |
Hi @kishwarshafin , I made the change you suggested. It didn't fix the issue, however you did help point me to the real cause. It turns out that when I created the confident regions bed file I failed to include the positions of the variants (only confident hom_ref positions). Having gone back and added these positions, examples now contain all 3 classes of site. One last question - given that the labeler was not the issue, should I switch back to using the haplotype aware labeller? I read somewhere that this is the better approach, though I am not sure why that is. Once I have heard from you on this last point I'll close the issue. Thanks Dan |
Glad you figured it out! So, Sorry for the vaguest answer, but there's no correct answer here. Hope your training goes well. |
Hi @kishwarshafin, Ok thanks for the explanation. Unfortunately, I am still having trouble. For some reason when I use positional_labeler my make_examples jobs fail. But they complete successfully using haplotype_labeler. I can't figure out what the issue is as the error message isn't particularly informative, at least not to me. I attach two log files from the make_examples step. Both are for the same sample, the only difference is that one uses haplotype_labeler (job succeeded) and the other uses positional_labeler (job failed). It would be great to get your opinion on what is going on. Note that I have tested positional labeler a few times and it does seem to work for one sample, but there is no reason this sample should be distinct from the others. haplotype_labeler: positional_labeler: Thanks Dan |
So it looks like the VCF parsing is failing here: I0911 20:23:50.839811 140713511122752 positional_labeler.py:163] Multiple matches detected; no good match found. Fall back to first. variant: reference_bases: "G"
alternate_bases: "A"
info {
key: "BAM_FNAME"
value {
values {
string_value: "SR_male_1.fixmate.coordsorted.bam"
}
}
}
calls {
info {
key: "AD"
value {
values {
int_value: 0
}
values {
int_value: 22
}
}
}
info {
key: "DP"
value {
values {
int_value: 22
}
}
}
info {
key: "VAF"
value {
values {
number_value: 1.0
}
}
}
genotype: -1
genotype: -1
call_set_name: "SR_male_1"
}
end: 14057936
reference_name: "NC_053213.1_chromosome_2"
start: 14057935
: matches: [reference_bases: "G"
alternate_bases: "A"
alternate_bases: "<NON_REF>" Looks like your truth contains alleles like "<NON_REF>" and when positional lableler tries to match something it fails. Is your truth VCF consistent with the regular VCF standards? You may need to filter non-standard calls like "NON_REF" as an alt allele from the truth to fix this. |
Hi @kishwarshafin , Yeh these are standard VCFs, outputted by GenotypeGVCFs (GATK 4.1.3) and filtered by BCFtools. After having done some more tests, I don't think the existence of <NON_REF> alleles is the issue for several reasons. 1) These alleles occur many times in the VCF well before the job fails. 2) The job often fails on loci that do not have such alleles. 3) After removing them (and all ALT alleles not found in the sample in each VCF) the jobs still fail. I cannot see any pattern in the loci reported in the log file at the point at which the job fails. And still, some jobs succeed. Anyway, I will keep digging and close this issue as the original question was solved. Thanks for the help. |
Hello DV team, and thanks for creating such a great tool!
I am currently trying to retrain the wgs model for a new species (a fish) however, during training, I see no evaluation statistics (precision, recall, f1) for either het or homalt. Or more specifically they are all 0.0. Eval stats are reported for homref though. I have now tried running the training several times with different hyperparameters but so far still no change at the het or homalt eval stats.
My first, very simple question is thus, are these eval stats truly 0 (i.e. the model is very bad) or is 0.0 some starting value and there are not enough data to calculate them initially? I am warmstarting from the 1.6.1 wgs model so I cant imagine the model is really that bad at calling variants initially, even if in a fish.
Setup
Running on a university computing cluster (https://hpc-unibe-ch.github.io/)
OS: Rocky 9.3 Blue Onyx
GPU: rtx4090
Installation: Running from Docker image via singularity
DV version: 1.6.1
Data
I am training on examples from 5 individuals, data from Illumina NovaSeq ~20x coverage.
17/21 chromosomes used for training (~1.45M examples)
2/21 chromosomes used for tuning (~200k examples)
2/21 chromosomes reserved for testing.
(Different chromosomes used for train/tune/test across samples - see below)
Shuffling
Performed downsampling=0.5.
Shuffled globally across samples, chromosomes and downsampling.
Command
My latest training run was like so:
Though previous runs had higher learning rates (0.01) and batch sizes (128). Training proceeds as follows:
Training Examples: 1454377
Batch Size: 64
Epochs: 1
Steps per epoch: 22724
Steps per tune: 3162
Num train steps: 22724
Log file
Here is the top of the log file, including some warnings in case they are relevant:
And here is an excerpt of from a later portion of the log file including some training and tuning steps, where you can see the 0.0 for het and homalt eval stats.
I am new to Deep Learning and am struggling to decide whether something is wrong with my training approach/scripts or whether the model just needs more time / different hyperparams. Given the number of examples, I can only run 1 epoch at a time before I hit the 24hr cluster wall-time limit. So I have only trained for around 30,000 steps in total across 2 epochs so far (starting from last checkpoint after 1st epoch).
All advice much appreciated!
The text was updated successfully, but these errors were encountered: