Release v0.9.0
Highlights
In this release, we offer a new schema for output BigQuery tables. The new schema utilizes BigQuery's integer range partitioning which significantly reduces the query costs. We also allow users to store BigQuery tables which are highly optimized for sample lookup queries, such as:
Find all variants of Patient X
Note this release contains backwards incompatible changes. Please see details below.
New Features / Improvements
- By default one BigQuery table per chromosome is created; each table is integer range partitioned.
- Output tables have suffixes such as
__chr1
,__chr2
, … - Output tables can be changed by modifying the sharding config file.
- Output tables have suffixes such as
call.name
is replaced withcall.sample_id
, wheresample_id
is the hash of sample name.- In cases where multiple VCF files have the same
name
, file path can be included in the hash value to distinguish between samples.
- In cases where multiple VCF files have the same
- An extra BQ table with
__sample_info
suffix is created. This table contains the mapping betweensample_id
tosample_name
andvcf_file_path
.- We also include an
ingestion_datetime
column in sample info table to record the ingestion datetime of each VCF file.
- We also include an
- 1-based coordinate is used by default for genomic indexing to make BigQuery tables more compatible with VCF files.
- If
--append
is set, we ensure all expected output tables already exist before we append them.
New flags
vcf_to_bq
:--sample_lookup_optimized_output_table
: to store a second copy of variants in BigQuery tables that are optimized for sample lookup queries. This feature is particularly useful when the input VCF file contains joint genotyped samples.--keep_intermediate_avro_files
: to store intermediate Avro files in your temp directory on GCS bucket.--use_1_based_coordinate
: By default start position will be 1-based, and end position will be inclusive. You can set this flag to False to use 0-based coordinate.--sample_name_encoding
: determines the waysample_id
is hashed. Default value isWITHOUT_FILE_PATH
. If set toWITH_FILE_PATH
, thensample_id
will be a hash of[vcf_file_path, sample_name]
.--sharding_config_path
: replaces--partition_config_path
.
bq_to_vcf
:--bq_uses_1_based_coordinate
: set to False, if--use_1_based_coordinate
was set to False when generating the BQ tables, and hence, start positions are 0-based.--sample_names
: replaces--call_names
.--preserve_sample_order
: replaces--preserve_call_names_order
.
docker run
flags:- All the following flags are required:
--project
--regions
--temp_location
- If you need to run Variant Transforms in a subnetwork using private IP addresses:
--subnetwork ${CUSTOM_SUBNETWORK}
--use_public_ips false
- All the following flags are required:
Deprecated flags
The following flags are deprecated and will be removed in the next release:
--optimize_for_large_inputs
: because sharing is done by default for all inputs.--num_bigquery_write_shards
: because we are using Avro sink in the Dataflow pipeline.--output_avro_path
: replaced with--keep_intermediate_avro_files
.--reference_names
: You can achieve the same goal by modifying default sharding config file.
Underlying improvements
- Switched our default VCF parser from PyVcf to PySam.
- Update to Beam 2.22.
- Launcher VM is changed to g1-small to reduce the overall cost of running VT.
Breaking Changes
- By default 1-based coordinate is used for genomic indexing. We use the same default value for
bq_to_vcf
soVCF -> BigQuery -> VCF
with default flags should work. call.name
is replaced withcall.sample_id
--partition_config_path
is replaced with--sharding_config_path
- Sharding config YAML format has changed.
- output table name cannot contain
__
because we reserve this string for separating table base name from the suffixes that we read from sharding config file.
The following flags have been removed in this release:
--vcf_parser
--partition_config_path
: replaced with--sharding_config_path
--call_names
: replaced with--sample_names
--preserve_call_names_order
: replaced with--preserve_sample_order