Skip to content

A Bayesian Phylo-HMM for B cell receptor sequence analysis

Notifications You must be signed in to change notification settings

naveenjasti/linearham

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

linearham

     __      _.._
  .-'__`-._.'.--.'.__.,
 /--'  '-._.'    '-._./
/__.--._.--._.'``-.__/
'._.-'-._.-._.-''-..'

Docker Repository on Quay  

References

  1. Dhar A, Ralph DK, Minin VN, Matsen FA IV (2020) A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis. PLoS Comput Biol 16(8): e1008030 (https://doi.org/10.1371/journal.pcbi.1008030).

Installation

Linearham's dependencies are described in the Dockerfile. We recommend that you run linearham inside a Docker container, since this will make installation much easier (if you're new to Docker, read this). However, you can also install the dependencies by hand, in which case you should clone the repository and run each command in the Dockerfile that's on a line starting with RUN (treat WORKDIR as cd). The more similar your system is to that described by the Dockerfile's FROM line (at the moment, debian), the easier this will be.

Using Docker

It's best to start by running an interactive session in the container:

docker run -it quay.io/matsengrp/linearham /bin/bash.

You can also create a shell script with your linearham commands for docker to run, and put it in place of /bin/bash (use test.sh as an example). If you want your run to be reproducible, choose a tag of the form v<stuff> from quay.io, then specify it like so: quay.io/matsengrp/linearham:v<stuff>.

To access your own data from within the container, or to persist your output beyond the container, you must use volumes by specifying -v. We recommend using this convention (other paths may not work):

  1. Choose a directory outside of Docker that contains both your input data directory and your desired output directory
  2. Mount it as a volume to /linearham/work inside the container: -v /your/local/path:/linearham/work
  3. All commands inside the container referencing paths inside that directory should do so via /linearham/work, e.g. setting --outdir=/linearham/work/output inside the container will persist your output outside the container in /your/local/path/output.

Note that because Docker must run as root, this means that you will be writing to the directory on your host machine as root, so a) be very careful and b) don't choose anything anywhere near / (something like /home/user/several/sub/dirs is good).

Running linearham

All linearham actions are run using scons in the main linearham directory. Available actions are --run-partis, --run-linearham, and --build-partis-linearham. Note that because of the way scons parses arguments, you must always use an = sign: --arg=val.

Linearham uses a partis output file as its input. If you've already run partis to create this file, skip to --run-linearham; if not, you can have linearham run partis for you with --run-partis.

--run-partis

This runs partis on an input sequence file:

scons --run-partis --fasta-path=<file> --locus={igh|igk|igl} --outdir=<dir> [--parameter-dir=<dir>]

--fasta-path can be any file type that partis --infname handles (see partis help). Partis requires a directory with fitted sample-specific parameters, which if not already present will be automatically inferred based on the sequences in the fasta file. Because linearham runs on only a single family at a time, but these parameters are more accurate if inferred on the entire repertoire of many clonal families, it is better if you use parameters cached in a previous partis run on the entire repertoire, and then pass them to linearham with --parameter-dir. However, if you don't, the automatically-inferred parameters will still work fine, they'll just be somewhat less accurate (since they'll only be based on the one family).

Other partis-related arguments:

Name Description
--all-clonal-seqs If set, attempts to force all sequences in the fasta file into the same clonal family; otherwise it runs partis partition to infer the clonal families
--locus Which immunoglobulin locus (defaults to igh)?
--outdir The output directory (defaults to output).

--run-linearham

Once you have a partis output file, whether you made it separately or with linearham, you can run linearham itself:

scons --run-linearham --outdir=<dir> [--partis-yaml-file=<file>] [--parameter-dir=<dir>]

If there is one clonal family (i.e. cluster) in the partis output file, linearham will run on that. If there is more than one, you'll have to select a cluster to run on using the following options. Partis performs clustering hierarchically, so its output stores a list of partitions, where each partition divides the sequences in the repertoire into clonal families (clusters). By default, linearham looks in the best (most likely) of these partitions, but you can specify the (zero-based) index of a different one with --partition-index. Within a partition, you can specify a cluster either by (zero-based) index with --cluster-index, or with the unique id of a particular sequence in the cluster with --cluster-seed-unique-id (see partis --seed-unique-id for more info). Other options:

Command Description
--partis-yaml-file Path to the partis output file that is linearham's input. Defaults to the location in --outdir to which the linearham --run-partis action writes it
--outdir The output directory (defaults to output).
--lineage-unique-ids ,-separated list of sequence ids, each of whose lineages will be analyzed separately, and in somewhat more detail than by default (see below).
--parameter-dir Directory from which linearham reads partis hmm files. If not set, it defaults to the location in --outdir used by --run-partis. As for --run-partis (above), parameters will be much more accurate if you cache them with partis beforehand on the entire repertoire, but if this isn't possible they'll be inferred automatically on the one family on which you're running linearham, which should be fine.

If you don't have a clonal family/cluster of interest, or are not sure how to identify it using these options, you can run scripts/parse_cluster.py to work it out.

For example, running:

./scripts/parse_cluster.py lib/partis/test/reference-results/partition-new-simu.yaml --fasta-output-file parsed_cluster.fa --yaml-output-file parsed_cluster.yaml | less -RS

will print a table of available clusters in the best partition similar to this:

 available clusters in partition at index 29 (best):
index   size    unique_ids
0       71      [...]
1       11      [...]
2       262     [...]
3       4       [...]

Using the indices from this table, you can specify the corresponding clusters to Linearham. Running on the cluster with 262 sequences from the above table would look like:

scons --run-linearham --cluster-index=2 <args.. >

You can also figure out which sequences are in which clusters with the partis view-output action piped to less -RS.

Other linearham-related arguments:

Command list? Description
--template-path no The RevBayes template path (defaults to templates/revbayes_template.rev).
--mcmc-iter yes How many RevBayes MCMC iterations should we use (defaults to 10000)?
--mcmc-thin yes What RevBayes MCMC thinning frequency should we use (defaults to 10)?
--tune-iter yes How many RevBayes tuning iterations should we use (defaults to 5000)?
--tune-thin yes What RevBayes tuning thinning frequency should we use (defaults to 100)?
--num-rates yes The number of gamma rate categories (defaults to 4).
--burnin-frac yes What fraction of MCMC burnin should we use (defaults to 0.1)?
--subsamp-frac yes What bootstrap sampling fraction should we use (defaults to 0.05)?
--rng-seed yes The random number generator (RNG) seed (defaults to 0).
--asr-pfilters no The ancestral sequence posterior probability thresholds (defaults to 0.1).

If some aguments are specified as ,-separated lists (middle column), linearham will run revbayes separately, writing to separate output directories, for all combinations of all such parameters. For more information on these arguments, run scons --help.

--build-partis-linearham

This compiles linearham, partis, and other dependencies. You'll only need to run this if you've either modified some source code or you're installing without docker.

Output files

A variety of output files are written to --outdir. In the top-level dir are:

  • the partis output file that was used as input for linearham (e.g. partis_run.yaml), which contains partis-inferred clonal families (clusters) and annotations (including inferred naive sequence)
  • sub dirs for each cluster on which linearham was run, with name of form cluster-N/ for the cluster index N

Within each cluster's subdir cluster-N/ are:

  • a fasta file cluster-N/cluster_seqs.fasta with each of the cluster's input sequences, as well as its partis-inferred naive sequence
    • NOTE the input sequences have SHM indels "reversed" (reverted to their state in the naive rearrangement), and non-variable regions (V/J framework) trimmed off
  • a partis yaml output file cluster.yaml resulting from pulling just this cluster out of the original partis output file that was used as input
  • the linearham output dir cluster-N/mcmciter<stuff> where <stuff> records the exact options of the run e.g. mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/

XXX why are there two levels of subdir, i.e. mcmcmiterXXX as well as burninfracYYY?

Within the linearham output dir mcmciter<stuff>

e.g. mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/:

file format description
lh_revbayes_run.trees XXX XXX
revbayes_run.log XXX XXX
revbayes_run.rev XXX XXX
revbayes_run.stdout.log XXX XXX
revbayes_run.trees XXX XXX

Within burninfrac<stuff>

e.g. burninfrac0.1_subsampfrac0.05/

file format description
linearham_run.log tsv posterior samples of the naive sequence annotation and the phylogenetic substitution model and rate variation parameters
linearham_run.trees newick posterior tree samples with ancestral sequence annotations (formatted for use by Dendropy
linearham_run.ess tsv approximate effective sample sizes for each field in linearham_run.log
aa_naive_seqs.fasta fasta each sampled naive amino acid sequence and its associated posterior probability
aa_naive_seqs.dnamap fasta (ish) map from each sampled naive amino acid sequence to its corresponding set of nucleotide naive sequences and posterior probabilities
aa_naive_seqs.png png logo plot of naive amino acid sequence posterior probability using WebLogo to visualize per-site uncertainties
linearham_annotations_all.yaml yaml XXX
linearham_annotations_best.yaml yaml XXX

If --lineage-unique-ids is specified, there will also be additional lineage-specific output files in subdirectories (one for each sequence id specified) like lineage_<uid>/. These include:

file format description
aa_lineage_seqs.fasta fasta for each intermediate ancestor in the lineage of the sequence with the specified id, the sampled amino acid sequence and its associated posterior probability
aa_lineage_seqs.dnamap fasta(ish) for each intermediate ancestor of the lineage of the sequence with the specified id, map from sampled amino acid sequence to its corresponding set of nucleotide sequences and posterior probabilities
aa_lineage_seqs.pfilterX.png png posterior probability lineage graphic made with Graphviz, where X is the posterior probability cutoff for the sampled sequences.

Naive sequence comparisons

One way to visualize the various output naive sequences and their probabilities is with lib/partis/bin/cf-linearham.py, which takes as input a linearham output dir and a partis output file (the latter preferably created with the --calculate-alternative-annotations option set). It then prints an ascii-art comparison of the amino acid and nucleotide naive sequences, as well as (for partis) a rundown of the alternative gene calls and their probabilities (the most likely of which was presumably input to linearham). More info here.

About

A Bayesian Phylo-HMM for B cell receptor sequence analysis

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 92.4%
  • Python 6.7%
  • Other 0.9%