Skip to content

Support scripts for "Retrovirus insertion site analysis of LGL leukemia patient genomes"

License

Notifications You must be signed in to change notification settings

rsharris/suffynx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

suffynx

Support scripts for "Retrovirus insertion site analysis of LGL leukemia patient genomes"

Dependencies

Installation

To install suffynx from source code:

  1. Download the latest version of suffynx using Github
cd some_path  
git clone https://github.com/rsharris/suffynx  
  1. Compile:
cd some_path/suffynx/chrom_avg  
make  

cd some_path/suffynx/encodachrom  
make  
  1. Set up paths:

Copy or symlink the executables (some_path/suffynx/chrom_avg/chrom_avg and some_path/suffynx/encodachrom/encodachrom) somewhere into your shell's $PATH. For example

cd my_exe_path
ln -s some_path/suffynx/chrom_avg/chrom_avg .
ln -s some_path/suffynx/encodachrom/encodachrom .

For every python script, symlink it somewhere into your shell's $PATH, without the ".py" extensions. For example

cd my_exe_path
ln -s some_path/suffynx/chrom_avg/close_intervals.py close_intervals
ln -s some_path/suffynx/chrom_avg/collect_tags.py collect_tags
ln -s some_path/suffynx/chrom_avg/create_script_call_insertions_discordant.py create_script_call_insertions_discordant
ln -s some_path/suffynx/chrom_avg/create_script_clipped_breakpoints.py create_script_clipped_breakpoints
ln -s some_path/suffynx/chrom_avg/create_script_clipped_breakpoints_high.py create_script_clipped_breakpoints_high
ln -s some_path/suffynx/chrom_avg/create_script_depth.py create_script_depth
ln -s some_path/suffynx/chrom_avg/create_script_discordant_mates_dense.py create_script_discordant_mates_dense
ln -s some_path/suffynx/chrom_avg/create_script_insert_depth.py create_script_insert_depth
ln -s some_path/suffynx/chrom_avg/create_script_insert_depth_dense.py create_script_insert_depth_dense
ln -s some_path/suffynx/chrom_avg/create_script_insert_depth_sparse.py create_script_insert_depth_sparse
ln -s some_path/suffynx/chrom_avg/create_script_insert_length.py create_script_insert_length
ln -s some_path/suffynx/chrom_avg/create_script_insert_length_sparse.py create_script_insert_length_sparse
ln -s some_path/suffynx/chrom_avg/create_script_insert_length_sparse_or_normal_inserts_sparse.py create_script_insert_length_sparse_or_normal_inserts_sparse
ln -s some_path/suffynx/chrom_avg/create_script_map.py create_script_map
ln -s some_path/suffynx/chrom_avg/create_script_short_or_discordant.py create_script_short_or_discordant
ln -s some_path/suffynx/chrom_avg/fill_genomic_interval_gaps.py fill_genomic_interval_gaps
ln -s some_path/suffynx/chrom_avg/filtered_sam_to_intervals.py filtered_sam_to_intervals
ln -s some_path/suffynx/chrom_avg/intervals_to_ucsc_catalog.py intervals_to_ucsc_catalog
ln -s some_path/suffynx/chrom_avg/keep_first.py keep_first
ln -s some_path/suffynx/chrom_avg/make_bigwig_info.py make_bigwig_info
ln -s some_path/suffynx/chrom_avg/make_bwa_jobs.py make_bwa_jobs
ln -s some_path/suffynx/chrom_avg/proximal_feature_intervals.py proximal_feature_intervals
ln -s some_path/suffynx/chrom_avg/sam_reader.py sam_reader
ln -s some_path/suffynx/chrom_avg/today.py today

Pipeline Tutorial

We use a two-layer job paradigm. For a given sample (i.e. reads from an individual) we first run a series of python programs to create job scripts. Then we run the job scripts.

Generally each stage of the pipeline corresponds to a track, where a track describes a value for each interval of the reference genome. There's a python script for each stage, which will create a job script that will compute the track for that stage. Job scripts are bash shell scripts.

The inputs to the process are a set of reads for the sample, a reference genome and a pipeline control file. You can arrange these files any way you like, but in the following example we assume that we have a single working directory with the following subdirectories

  • reads: This contains two fastq files for each sequencing run; usually this will be one run for mate pair and another run for paired end.

  • genomes: This contains the reference fasta file, the bwa index, a chromosome lengths file, and any blacklist interval files.

  • data: This contains the pipeline control file.

  • jobs: The job scripts will be created here.

  • alignments: Read-vs-reference alignments will be written here.

  • tracks: Track files will be created here.

  • temp: Temporary files will be created here.

In this example, the genome is named "reference", and the sample is named "ZEB". We start with paired end reads files reads/ZEB_PE.1.fastq and reads/ZEB_PE.2.fastq, and mate pair reads files reads/ZEB_MP.1.fastq and reads/ZEB_MP.2.fastq. data/control.dat has been copied from the repository and modified if necessary. There are two blacklist files, genomes/reference.Ns.dat and genomes/repeat_masker.reference.dat.

Create the scripts that will map the reads to the reference.

create_script_map \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --ref="  {base}/genomes/reference.fa" \
      --reads="{base}/reads/{run}.{mate}.fastq" \
      --bam="  {base}/alignments/{run}" \
      --namesorted \
      --qualityfiltered \
  > jobs/ZEB_MP.map.sh
chmod +x jobs/ZEB_MP.map.sh

Create the scripts that will compute the average mate pair insert length signal and indicator tracks.

This is called "Track 1" in supplementary methods step 2.

create_script_insert_length \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --bam={base}/alignments/{run}.ql_filtered.name_sorted.bam \
      --namesorted \
      --chroms={base}/genomes/reference.chrom_lengths \
      --track={base}/tracks/{run}.insert_length \
      --gzip \
  > jobs/ZEB_MP.insert_length.sh
chmod +x jobs/ZEB_MP.insert_length.sh
create_script_insert_length_sparse \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --blacklist={base}/genomes/reference.Ns.dat \
      --blacklist={base}/genomes/repeat_masker.reference.dat \
      --input={base}/tracks/{run}.insert_length.gz \
      --track={base}/tracks/{run}.insert_length.sparse \
  > jobs/ZEB_MP.insert_length_sparse.sh
chmod +x jobs/ZEB_MP.insert_length_sparse.sh

Create the scripts that will compute the mate pair short and normal insert coverage depth signal and indicator tracks.

These are called "Tracks 2 and 3" in supplementary methods step 3.

create_script_insert_depth \
      --class=short \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --bam={base}/alignments/{run}.ql_filtered.name_sorted.bam \
      --namesorted \
      --chroms={base}/genomes/reference.chrom_lengths \
      --track={base}/tracks/{run}.{kind}_inserts.depth \
  > jobs/ZEB_MP.insert_depth.short.sh
chmod +x jobs/ZEB_MP.insert_depth.short.sh
create_script_insert_depth_dense \
      --class=short \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --input={base}/tracks/{run}.{kind}_inserts.depth \
      --track={base}/tracks/{run}.{kind}_inserts.depth.dense \
  > jobs/ZEB_MP.insert_depth_dense.sh
chmod +x jobs/ZEB_MP.insert_depth_dense.sh
create_script_insert_depth \
      --class=normal \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --bam={base}/alignments/{run}.ql_filtered.name_sorted.bam \
      --namesorted \
      --chroms={base}/genomes/reference.chrom_lengths \
      --track={base}/tracks/{run}.{kind}_inserts.depth \
  > jobs/ZEB_MP.insert_depth.normal.sh
chmod +x jobs/ZEB_MP.insert_depth.normal.sh
create_script_insert_depth_sparse \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --blacklist={base}/genomes/reference.Ns.dat \
      --blacklist={base}/genomes/repeat_masker.reference.dat \
      --input={base}/tracks/{run}.{kind}_inserts.depth \
      --track={base}/tracks/{run}.{kind}_inserts.depth.sparse \
  > jobs/ZEB_MP.insert_depth_sparse.sh
chmod +x jobs/ZEB_MP.insert_depth_sparse.sh

Create the scripts that will compute the mate pair discordant mates coverage depth signal and indicator tracks.

This is called "Track 4" in supplementary methods step 4.

See https://github.com/rsharris/suffynx/tree/master/discordant_mates for creation of the signal track.

This creates the script that converts the signal track to an indicator track.

create_script_discordant_mates_dense \
      --class=short \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --input={base}/tracks/{run}.BDB.MMQ40.MCP40.rmdup.bedgraph \
      --track={base}/tracks/{run}.discordant_mates.dense \
  > jobs/ZEB_MP.discordant_mates_dense.sh
chmod +x jobs/ZEB_MP.discordant_mates_dense.sh

Create the scripts that will compute the paired end clipped breakpoints signal and indicator tracks.

This is called "Track 5" in supplementary methods step 5.

create_script_clipped_breakpoints \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_PE \
      --bam={base}/alignments/{run}.ql_filtered.name_sorted.bam \
      --namesorted \
      --chroms={base}/genomes/reference.chrom_lengths \
      --track={base}/tracks/{run}.clipped_breakpoints \
  > jobs/ZEB_PE.clipped_breakpoints.sh
chmod +x jobs/ZEB_PE.clipped_breakpoints.sh
create_script_clipped_breakpoints_high \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_PE \
      --chroms={base}/genomes/reference.chrom_lengths \
      --input={base}/tracks/{run}.clipped_breakpoints \
      --track={base}/tracks/{run}.clipped_breakpoints.high \
  > jobs/ZEB_PE.clipped_breakpoints_high.sh
chmod +x jobs/ZEB_PE.clipped_breakpoints_high.sh

Create the scripts that will combine tracks 1 thru 5 and call insertions.

As per supplementary methods step 6.

create_script_insert_length_sparse_or_normal_inserts_sparse \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --input1={base}/tracks/{run}.insert_length.sparse \
      --input2={base}/tracks/{run}.normal_inserts.depth.sparse \
      --track={base}/tracks/{run}.insert_length_sparse_or_normal_inserts_sparse \
  > jobs/ZEB_MP.track1_or_track2.sh
chmod +x jobs/ZEB_MP.track1_or_track2.sh
create_script_short_or_discordant \
      --init=shebang:bash \
      --base="`pwd`" \
      ZEB_MP \
      --chroms={base}/genomes/reference.chrom_lengths \
      --input1={base}/tracks/{run}.short_inserts.depth.dense \
      --input2={base}/tracks/{run}.discordant_mates.dense \
      --track={base}/tracks/{run}.short_or_discordant \
  > jobs/ZEB_MP.track3_or_track4.sh
chmod +x jobs/ZEB_MP.track3_or_track4.sh
create_script_call_insertions_discordant \
      --control=data/control.dat \
      --init=shebang:bash \
      --base="`pwd`" \
      peRun=ZEB_PE
      mpRun=ZEB_MP
      --chroms={base}/genomes/reference.chrom_lengths \
      --input="{base}/tracks/{mprun}.insert_length_sparse_or_normal_inserts_sparse" \
      --input="{base}/tracks/{mprun}.short_or_discordant" \
      --track="{base}/tracks/{run}.called_insertions" \
  > jobs/ZEB.called_insertions.sh
chmod +x jobs/ZEB.called_insertions.sh

Compute.

Once all the jobs scripts have been created, they should be run, like this:

./jobs/ZEB_MP.map.sh
./jobs/ZEB_MP.insert_length.sh
./jobs/ZEB_MP.insert_length_sparse.sh
./jobs/ZEB_MP.insert_depth.normal.sh
./jobs/ZEB_MP.insert_depth_sparse.sh
./jobs/ZEB_MP.insert_depth.short.sh
./jobs/ZEB_MP.insert_depth_dense.sh
./jobs/ZEB_MP.discordant_mates_dense.sh
./jobs/ZEB_PE.clipped_breakpoints.sh
./jobs/ZEB_PE.clipped_breakpoints_high.sh
./jobs/ZEB_MP.track1_or_track2.sh
./jobs/ZEB_MP.track3_or_track4.sh
./jobs/ZEB.called_insertions.sh

Feb/2019, Bob Harris (rsharris at bx dot psu dot edu)

About

Support scripts for "Retrovirus insertion site analysis of LGL leukemia patient genomes"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published