WARNING: Not actively maintained!
This tutorial combines a particular reference genome with individual annotation resources to create an "all possible SNPs" table.
dsub and BigQuery are used to run all of these steps in the cloud.
This is a work-in-progress. Next steps are to:
- add more annotation resources to the JOIN
- add examples that make use of the all possible SNPs GRCh38 table to analyze SNPs from the Platinum Genomes DeepVariant cohort. These examples would be similar to https://github.com/googlegenomics/bigquery-examples/tree/master/platinumGenomes.
Set a few environment variables to facilitate cutting and pasting the subsequent commands.
# The Google Cloud Platform project id in which to process the annotations.
PROJECT_ID=your-project-id
# The bucket name (with the gs:// prefix) for logs and temp files.
BUCKET=gs://your-bucket-name
# The BigQuery dataset, which must already exist, in which to store annotations.
DATASET=your_bigquery_dataset_name
In this tutorial we're specifically working with Verily’s version of GRCh38.
Note that instead you could:
- Use one of the other reference genomes are already available in Cloud Storage. See Reference Genomes for the list and Cloud Storage paths.
- Copy the FASTA file for the desired reference genome to cloud storage. For more detail, see Copying large files to a bucket.
Run the following dsub command to convert the FASTA file for the reference genome into a format ammenable to BigQuery.
# Copy the script dsub will run to Cloud Storage.
gsutil cp fasta_to_kv.py ${BUCKET}
# Run the conversion operation.
dsub \
--project ${PROJECT_ID} \
--zones "us-central1-*" \
--logging ${BUCKET}/fasta_to_kv.log \
--image python:2.7-slim \
--input FASTA=gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \
--input CONVERTER=${BUCKET}/fasta_to_kv.py \
--output KV=${BUCKET}/GRCh38_Verily_v1.genome.txt \
--command 'cat "${FASTA}" | python "${CONVERTER}" > "${KV}"' \
--wait
Use the bq command line tool to load the sequences into BigQuery.
bq --project ${PROJECT_ID} load \
-F '>' \
--schema unused:string,chr:string,sequence_start:integer,sequence:string \
${DATASET}.VerilyGRCh38_sequences \
${BUCKET}/GRCh38_Verily_v1.genome.txt
Run script render_templated_sql.py to create the SQL that will perform the JOIN.
python ./render_templated_sql.py \
--sequence_table ${DATASET}.VerilyGRCh38_sequences \
--b38
Then run the generated SQL via the BigQuery web UI or the bq command line tool and materialize the result to a new table.
See render_templated_sql.py --help
for more details.