Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tools for SV VCF cleaning #8996

Draft
wants to merge 62 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d8fb88c
Initial commit
kjaisingh Oct 9, 2024
0c72bd2
Silenced SV type processing
kjaisingh Oct 11, 2024
05501ce
Created initial commit for 1b
kjaisingh Oct 16, 2024
40295e1
Reformatting per GATK style guide
kjaisingh Oct 16, 2024
5cd8ea3
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
75f476b
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
8d31a61
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
12cc930
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
4ecd525
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
df375cc
PR feedback
kjaisingh Oct 17, 2024
6f34672
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Oct 17, 2024
2e28e13
WIP commit - prior to deprecating BED file input in 1b
kjaisingh Oct 17, 2024
f0c0e0f
Updated to no longer ingest BED file
kjaisingh Oct 18, 2024
50c4eaf
Cleaned up scripts...
kjaisingh Oct 18, 2024
74de1b0
SVCleanPt2 WIP
kjaisingh Oct 18, 2024
354857f
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Oct 18, 2024
74f0d73
Working version of SVCleanPt2
kjaisingh Oct 22, 2024
54f5f96
Code cleanup
kjaisingh Oct 22, 2024
18da350
Added sorting and better formatting of outputs
kjaisingh Oct 22, 2024
f7e14c6
Initial commit for CleanPt4
kjaisingh Oct 23, 2024
5f5a6cd
WIP
kjaisingh Oct 24, 2024
52a9049
WIP - up till CleanVcf5 (first task complete)
kjaisingh Oct 25, 2024
31f7032
Reformatting & restructuring
kjaisingh Oct 26, 2024
2b97b7b
Completed CleanVcf4 / implemented skeleton walker for CleanVcf5
kjaisingh Oct 28, 2024
45443f9
Revert SVCleanPt2 to use overlap buffer
kjaisingh Oct 30, 2024
7fbc3a2
Working implementation of SVCleanPt5
kjaisingh Oct 30, 2024
181b352
Modified param name for chrX/chrY
kjaisingh Oct 31, 2024
3eb5c3d
Changes to test
kjaisingh Oct 31, 2024
52daa21
Changes to test
kjaisingh Oct 31, 2024
d438bb9
CleanVcf4 added exit if not in range
kjaisingh Oct 31, 2024
9aaf2f1
Updated type of EV format field
kjaisingh Oct 31, 2024
1de3e2f
Minor changes - replaced EV type
kjaisingh Oct 31, 2024
4e353a0
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 1, 2024
a57601f
Skip no-call genotypes
kjaisingh Nov 1, 2024
2f17eec
Changes post-debugging: modify .getEnd() to use SVLEN
kjaisingh Nov 4, 2024
4c60608
Undo use of getEnd()
kjaisingh Nov 5, 2024
56ecaeb
Furhter debugging: modified SVTYPE update & corresponding genotype as…
kjaisingh Nov 5, 2024
82ecea8
Handled no-call rewriting bug
kjaisingh Nov 5, 2024
d96d810
Modifications to large del/dup event logic
kjaisingh Nov 6, 2024
0fa7738
Modified conditions for biallelic filtering
kjaisingh Nov 6, 2024
a281d65
Modified filter to only use distinct pairs
kjaisingh Nov 6, 2024
f7215ee
Moved svtype modification to cleanpt4
kjaisingh Nov 7, 2024
383a01c
Reverted svtype changes
kjaisingh Nov 7, 2024
c9d1020
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 12, 2024
ee16037
Overlap logic change to align with GATK-SV outputs
kjaisingh Nov 12, 2024
3f901f4
Modified overlap logic to only be unidirectional
kjaisingh Nov 12, 2024
e8ce4c5
Standardized variant overlap code
kjaisingh Nov 12, 2024
29dc96f
Modified overlap logic - minor bug
kjaisingh Nov 12, 2024
a68dcc6
Modified CleanPt4 imports
kjaisingh Nov 14, 2024
8ba660e
Minor change to remove redundant size/svtype check
kjaisingh Nov 18, 2024
657e601
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 22, 2024
707ada6
Minor file creation
kjaisingh Nov 25, 2024
1e30b4b
New tools - modular implementations
kjaisingh Dec 11, 2024
726144d
Updated to include necessary imports only
kjaisingh Dec 11, 2024
0ab56eb
Updated header writing
kjaisingh Dec 13, 2024
8fbd27e
Added sex revisions for male GT
kjaisingh Jan 13, 2025
a40b193
Concatenated overlapping cnv tasks into one
kjaisingh Jan 17, 2025
63838fe
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Jan 17, 2025
e60fc9c
Bug-fixed to match results from 5-pass version
kjaisingh Jan 21, 2025
a4db400
Standardized overlap logic across OverlappingCnv task methods
kjaisingh Jan 21, 2025
7f9b22d
Used caching to improve runtime
kjaisingh Jan 29, 2025
f46e501
Minor structural changes to overlap code
kjaisingh Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
import com.google.common.collect.HashBiMap;
import htsjdk.variant.variantcontext.Allele;

import java.util.Arrays;
import java.util.List;
import java.util.Set;
import java.util.HashSet;
import java.util.Map;

import static java.util.Map.entry;
Expand Down Expand Up @@ -161,6 +165,39 @@ public enum ComplexVariantSubtype {
public static final String LOW_QS_SCORE_FILTER_KEY = "LOW_QS";
public static final String FREQUENCY_FILTER_KEY = "FREQ";

// CleanVcf
public static final String ME = "ME";
public static final String VAR_GQ = "varGQ";
public static final String MULTIALLELIC = "MULTIALLELIC";
public static final String UNRESOLVED = "UNRESOLVED";
public static final String HIGH_SR_BACKGROUND = "HIGH_SR_BACKGROUND";
public static final String BOTHSIDES_SUPPORT = "BOTHSIDES_SUPPORT";
public static final String PESR_GT_OVERDISPERSION = "PESR_GT_OVERDISPERSION";
public static final String REVISED_EVENT = "REVISED_EVENT";
public static final String MULTI_CNV = "MULTI_CNV";

public static final String RD_CN = "RD_CN";
public static final String RD_GQ = "RD_GQ";
public static final String PE_GT = "PE_GT";
public static final String SR_GT = "SR_GT";
public static final String PE_GQ = "PE_GQ";
public static final String SR_GQ = "SR_GQ";
public static final String CNV = "CNV";
public static final String UNR = "UNR";
public static final String EVENT = "EVENT";
public static final String EV = "EV";
public static final List<String> EV_VALUES = Arrays.asList(
null, "RD", "PE", "RD,PE", "SR", "RD,SR", "PE,SR", "RD,PE,SR"
);
public static final Set<String> FILTER_VCF_LINES = new HashSet<>(Arrays.asList(
"CIPOS", "CIEND", "RMSSTD", "source", "bcftools", "GATKCommandLine", "#CHROM"
));

public static final Set<String> FILTER_VCF_INFO_LINES = new HashSet<>(Arrays.asList(
GATKSVVCFConstants.UNRESOLVED, GATKSVVCFConstants.MULTIALLELIC, GATKSVVCFConstants.VAR_GQ,
GATKSVVCFConstants.MULTI_CNV, GATKSVVCFConstants.REVISED_EVENT, GATKSVVCFConstants.EVENT
));

// Clustering
public static final String CLUSTER_MEMBER_IDS_KEY = "MEMBERS";

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
package org.broadinstitute.hellbender.tools.walkers.sv;

import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.Genotype;
import htsjdk.variant.variantcontext.GenotypeBuilder;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.variantcontext.VariantContextBuilder;
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
import htsjdk.variant.vcf.VCFHeader;
import htsjdk.variant.vcf.VCFInfoHeaderLine;
import htsjdk.variant.vcf.VCFFilterHeaderLine;
import htsjdk.variant.vcf.VCFHeaderLineType;

import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.BetaFeature;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.barclay.help.DocumentedFeature;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariantDiscoveryProgramGroup;
import org.broadinstitute.hellbender.engine.*;
import org.broadinstitute.hellbender.tools.spark.sv.utils.GATKSVVCFConstants;
import org.broadinstitute.hellbender.utils.tsv.TableUtils;
import org.broadinstitute.hellbender.utils.tsv.TableReader;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Path;

import java.util.Arrays;
import java.util.List;
import java.util.ArrayList;
import java.util.Set;
import java.util.stream.Collectors;

/**
* Completes an initial series of cleaning steps for a VCF produced by the GATK-SV pipeline.
*
* <h3>Inputs</h3>
* <ul>
* <li>
* VCF containing structural variant (SV) records from the GATK-SV pipeline.
* </li>
* <li>
* TODO
* </li>
* </ul>
*
* <h3>Output</h3>
* <ul>
* <li>
* Cleansed VCF.
* </li>
* </ul>
*
* <h3>Usage Example</h3>
* <pre>
* TODO
* </pre>
*
* <h3>Processing Steps</h3>
* <ol>
* <li>
* TODO
* </li>
* </ol>
*/
@CommandLineProgramProperties(
summary = "Clean and format SV VCF",
oneLineSummary = "Clean and format SV VCF",
programGroup = StructuralVariantDiscoveryProgramGroup.class
)
@BetaFeature
@DocumentedFeature
public final class SVCleanPt1a extends VariantWalker {
public static final String CHRX_LONG_NAME = "chr-X";
public static final String CHRY_LONG_NAME = "chr-Y";
public static final String FAIL_LIST_LONG_NAME = "fail-list";
public static final String PASS_LIST_LONG_NAME = "pass-list";
public static final String OUTPUT_SAMPLES_LIST_LONG_NAME = "output-samples-list";

@Argument(
fullName = CHRX_LONG_NAME,
doc = "chrX column name",
optional = true
)
private final String chrX = "chrX";

@Argument(
fullName = CHRY_LONG_NAME,
doc = "chrY column name",
optional = true
)
private final String chrY = "chrY";

@Argument(
fullName = FAIL_LIST_LONG_NAME,
doc = "File with variants failing the background test"
)
private GATKPath failList;

@Argument(
fullName = PASS_LIST_LONG_NAME,
doc = "File with variants passing both sides"
)
private GATKPath passList;

@Argument(
fullName = OUTPUT_SAMPLES_LIST_LONG_NAME,
doc = "Output file with samples"
)
private GATKPath outputSamplesList;

@Argument(
fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME,
shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME,
doc = "Output VCF name"
)
private GATKPath outputVcf;

private VariantContextWriter vcfWriter;
private BufferedWriter samplesWriter = null;

private Set<String> failSet;
private Set<String> passSet;

private static final int MIN_ALLOSOME_EVENT_SIZE = 5000;

@Override
public void onTraversalStart() {
// Read supporting files
failSet = readLastColumn(failList);
passSet = readLastColumn(passList);
Comment on lines +132 to +133
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use TableReader here. It would require you to add header lines to the lists in the WDL, which would be okay too. See TableUtils.reader() and look at some implementations to see examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the tip. For visibility, I have made corresponding changes to GATK-SV in broadinstitute/gatk-sv#733.


// Add new header lines
VCFHeader header = getHeaderForVariants();
header.addMetaDataLine(new VCFFilterHeaderLine(GATKSVVCFConstants.UNRESOLVED, "Variant is unresolved"));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.HIGH_SR_BACKGROUND, 0, VCFHeaderLineType.Flag, "High number of SR splits in background samples indicating messy region"));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.BOTHSIDES_SUPPORT, 0, VCFHeaderLineType.Flag, "Variant has read-level support for both sides of breakpoint"));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.REVISED_EVENT, 0, VCFHeaderLineType.Flag, "Variant has been revised due to a copy number mismatch"));

// Write header
vcfWriter = createVCFWriter(outputVcf);
vcfWriter.writeHeader(header);

// Write samples list
try {
samplesWriter = new BufferedWriter(new FileWriter(outputSamplesList.toPath().toFile()));
for (String sample : header.getGenotypeSamples()) {
samplesWriter.write(sample);
samplesWriter.newLine();
}
samplesWriter.flush();
} catch (IOException e) {
throw new RuntimeException("Can't create output file", e);
}
}

@Override
public void closeTool() {
if (vcfWriter != null) {
vcfWriter.close();
}
}

@Override
public void apply(final VariantContext variant, final ReadsContext readsContext, final ReferenceContext referenceContext, final FeatureContext featureContext) {
// Create core data
VariantContextBuilder builder = new VariantContextBuilder(variant);
List<Genotype> genotypes = variant.getGenotypes();

// Process variant
genotypes = processEV(genotypes);
processVarGQ(variant, builder);
processMultiallelic(variant, builder);
processUnresolved(variant, builder);
processNoisyEvents(variant, builder);
processBothsidesSupportEvents(variant, builder);
genotypes = processAllosomes(variant, builder, genotypes);

builder.genotypes(genotypes);
vcfWriter.add(builder.make());
}

private List<Genotype> processEV(final List<Genotype> genotypes) {
List<Genotype> updatedGenotypes = new ArrayList<>(genotypes.size());
for (Genotype genotype : genotypes) {
if (genotype.hasExtendedAttribute(GATKSVVCFConstants.EV)) {
GenotypeBuilder gb = new GenotypeBuilder(genotype);
String evAttribute = (String) genotype.getExtendedAttribute(GATKSVVCFConstants.EV);
final int evIndex = Integer.parseInt(evAttribute);
if (evIndex >= 0 && evIndex < GATKSVVCFConstants.EV_VALUES.size()) {
gb.attribute(GATKSVVCFConstants.EV, GATKSVVCFConstants.EV_VALUES.get(evIndex));
}
updatedGenotypes.add(gb.make());
} else {
updatedGenotypes.add(genotype);
}
}
return updatedGenotypes;
}

private void processVarGQ(final VariantContext variant, final VariantContextBuilder builder) {
if (variant.hasAttribute(GATKSVVCFConstants.VAR_GQ)) {
final double varGQ = variant.getAttributeAsDouble(GATKSVVCFConstants.VAR_GQ, 0);
builder.rmAttribute(GATKSVVCFConstants.VAR_GQ);
builder.log10PError(varGQ / -10.0);
}
}

private void processMultiallelic(final VariantContext variant, final VariantContextBuilder builder) {
if (variant.hasAttribute(GATKSVVCFConstants.MULTIALLELIC)) {
builder.rmAttribute(GATKSVVCFConstants.MULTIALLELIC);
}
}

private void processUnresolved(final VariantContext variant, final VariantContextBuilder builder) {
if (variant.hasAttribute(GATKSVVCFConstants.UNRESOLVED)) {
builder.rmAttribute(GATKSVVCFConstants.UNRESOLVED);
builder.filter(GATKSVVCFConstants.UNRESOLVED);
}
}

private void processNoisyEvents(final VariantContext variant, final VariantContextBuilder builder) {
if (failSet.contains(variant.getID())) {
builder.attribute(GATKSVVCFConstants.HIGH_SR_BACKGROUND, true);
}
}

private void processBothsidesSupportEvents(final VariantContext variant, final VariantContextBuilder builder) {
if (passSet.contains(variant.getID())) {
builder.attribute(GATKSVVCFConstants.BOTHSIDES_SUPPORT, true);
}
}

private List<Genotype> processAllosomes(final VariantContext variant, final VariantContextBuilder builder, final List<Genotype> genotypes) {
final String chromosome = variant.getContig();
if (!chromosome.equals(chrX) && !chromosome.equals(chrY)) {
return genotypes;
}

List<Genotype> updatedGenotypes = new ArrayList<>(genotypes.size());
for (Genotype genotype : genotypes) {
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, "");
if ((svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DEL) || svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DUP)) &&
(variant.getAttributeAsInt(GATKSVVCFConstants.SVLEN, 0) >= MIN_ALLOSOME_EVENT_SIZE)) {
GenotypeBuilder gb = new GenotypeBuilder(genotype);
final boolean isY = chromosome.equals(chrY);
final int sex = (int) genotype.getExtendedAttribute(GATKSVVCFConstants.EXPECTED_COPY_NUMBER_FORMAT);
if (sex == 1 && isRevisableEvent(variant, isY, sex)) { // Male
builder.attribute(GATKSVVCFConstants.REVISED_EVENT, true);
adjustMaleGenotype(genotype, gb, svType);
} else if (sex == 2 && isY) { // Female
gb.alleles(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL));
} else if (sex == 0) { // Unknown
gb.alleles(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL));
}
updatedGenotypes.add(gb.make());
}
else {
updatedGenotypes.add(genotype);
}
}
return updatedGenotypes;
}

private void adjustMaleGenotype(final Genotype genotype, final GenotypeBuilder gb, final String svType) {
if (genotype.hasExtendedAttribute(GATKSVVCFConstants.RD_CN)) {
final int rdCN = Integer.parseInt(genotype.getExtendedAttribute(GATKSVVCFConstants.RD_CN).toString());
gb.attribute(GATKSVVCFConstants.RD_CN, rdCN + 1);

final Allele refAllele = genotype.getAllele(0);
final Allele altAllele = genotype.getAllele(1);
if (svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DEL)) {
if (rdCN >= 1) gb.alleles(Arrays.asList(refAllele, refAllele));
else if (rdCN == 0) gb.alleles(Arrays.asList(refAllele, altAllele));
} else if (svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DUP)) {
if (rdCN <= 1) gb.alleles(Arrays.asList(refAllele, refAllele));
else if (rdCN == 2) gb.alleles(Arrays.asList(refAllele, altAllele));
else gb.alleles(Arrays.asList(altAllele, altAllele));
}
}
}

private boolean isRevisableEvent(final VariantContext variant, final boolean isY, final int sex) {
final List<Genotype> genotypes = variant.getGenotypes();
final int[] maleCounts = new int[4];
final int[] femaleCounts = new int[4];
for (final Genotype genotype : genotypes) {
final int rdCN = (int) genotype.getExtendedAttribute(GATKSVVCFConstants.RD_CN, -1);
final int rdCNVal = Math.min(rdCN, 3);
if (rdCNVal == -1) continue;

if (sex == 1) {
maleCounts[rdCNVal]++;
} else if (sex == 2) {
femaleCounts[rdCNVal]++;
}
}

final int maleMedian = calcMedianDistribution(maleCounts);
final int femaleMedian = calcMedianDistribution(femaleCounts);
return maleMedian == 2 && (isY ? femaleMedian == 0 : femaleMedian == 4);
}

private int calcMedianDistribution(final int[] counts) {
final int total = Arrays.stream(counts).sum();
if (total == 0) return -1;

final int target = total / 2;
int runningTotal = 0;
for (int i = 0; i < 4; i++) {
runningTotal += counts[i];
if (runningTotal == target) {
return i * 2 + 1;
} else if (runningTotal > target) {
return i * 2;
}
}
throw new RuntimeException("Error calculating median");
}

private Set<String> readLastColumn(final GATKPath filePath) {
try {
final Path path = filePath.toPath();
final TableReader<String> reader = TableUtils.reader(path, (columns, exceptionFactory) ->
(dataline) -> dataline.get(columns.columnCount() - 1)
);

Set<String> result = reader.stream().collect(Collectors.toSet());
reader.close();
return result;
} catch (IOException e) {
throw new RuntimeException("Error reading variant list file: " + filePath, e);
}
}
}
Loading
Loading