Skip to content

Release v0.9.0

Compare
Choose a tag to compare
@samanvp samanvp released this 09 Jul 23:17
· 39 commits to master since this release
ee3b767

Highlights

In this release, we offer a new schema for output BigQuery tables. The new schema utilizes BigQuery's integer range partitioning which significantly reduces the query costs. We also allow users to store BigQuery tables which are highly optimized for sample lookup queries, such as:

Find all variants of Patient X

Note this release contains backwards incompatible changes. Please see details below.

New Features / Improvements

  • By default one BigQuery table per chromosome is created; each table is integer range partitioned.
    • Output tables have suffixes such as __chr1, __chr2, …
    • Output tables can be changed by modifying the sharding config file.
  • call.name is replaced with call.sample_id, where sample_id is the hash of sample name.
    • In cases where multiple VCF files have the same name, file path can be included in the hash value to distinguish between samples.
  • An extra BQ table with __sample_info suffix is created. This table contains the mapping between sample_id to sample_name and vcf_file_path.
    • We also include an ingestion_datetime column in sample info table to record the ingestion datetime of each VCF file.
  • 1-based coordinate is used by default for genomic indexing to make BigQuery tables more compatible with VCF files.
  • If --append is set, we ensure all expected output tables already exist before we append them.

New flags

  • vcf_to_bq:
    • --sample_lookup_optimized_output_table: to store a second copy of variants in BigQuery tables that are optimized for sample lookup queries. This feature is particularly useful when the input VCF file contains joint genotyped samples.
    • --keep_intermediate_avro_files: to store intermediate Avro files in your temp directory on GCS bucket.
    • --use_1_based_coordinate: By default start position will be 1-based, and end position will be inclusive. You can set this flag to False to use 0-based coordinate.
    • --sample_name_encoding: determines the way sample_id is hashed. Default value is WITHOUT_FILE_PATH. If set to WITH_FILE_PATH, then sample_id will be a hash of [vcf_file_path, sample_name].
    • --sharding_config_path: replaces --partition_config_path.
  • bq_to_vcf:
    • --bq_uses_1_based_coordinate: set to False, if --use_1_based_coordinate was set to False when generating the BQ tables, and hence, start positions are 0-based.
    • --sample_names: replaces --call_names.
    • --preserve_sample_order: replaces --preserve_call_names_order.
  • docker run flags:
    • All the following flags are required:
      • --project
      • --regions
      • --temp_location
    • If you need to run Variant Transforms in a subnetwork using private IP addresses:
      • --subnetwork ${CUSTOM_SUBNETWORK}
      • --use_public_ips false

Deprecated flags

The following flags are deprecated and will be removed in the next release:

  • --optimize_for_large_inputs: because sharing is done by default for all inputs.
  • --num_bigquery_write_shards: because we are using Avro sink in the Dataflow pipeline.
  • --output_avro_path: replaced with --keep_intermediate_avro_files.
  • --reference_names: You can achieve the same goal by modifying default sharding config file.

Underlying improvements

  • Switched our default VCF parser from PyVcf to PySam.
  • Update to Beam 2.22.
  • Launcher VM is changed to g1-small to reduce the overall cost of running VT.

Breaking Changes

  • By default 1-based coordinate is used for genomic indexing. We use the same default value for bq_to_vcf so VCF -> BigQuery -> VCF with default flags should work.
  • call.name is replaced with call.sample_id
  • --partition_config_path is replaced with --sharding_config_path
  • Sharding config YAML format has changed.
  • output table name cannot contain __ because we reserve this string for separating table base name from the suffixes that we read from sharding config file.

The following flags have been removed in this release:

  • --vcf_parser
  • --partition_config_path: replaced with --sharding_config_path
  • --call_names: replaced with --sample_names
  • --preserve_call_names_order: replaced with --preserve_sample_order