Probabilistic alignment software to deal with indels in Ig receptor sequences.
It has been introduced in the paper: Cosimo Lupo, Natanael Spisak, Aleksandra M. Walczak, Thierry Mora, "Learning the statistics and landscape of somatic mutation-induced insertions and deletions in antibodies", arXiv (2021) 2112.07953.
Latest released version: 1.2.4 (dated 10/03/2022).
Free use of indie is granted under the terms of the GNU General Public License version 3 (GPLv3).
For any issue, question or bug, please write us an email.
Before proceeding to the compilation, make sure your platform is equipped with a C compiler supporting OpenMP. The GNU C compiler (ie gcc) is recommended, and usually already present on Linux systems. If on MacOs, please be careful and read the section below before going on with the compilation. indie has not yet been tested on Windows.
No libraries are required to be installed separately; all the necessary files are already shipped with indie distribution.
Mac OS platforms are now equipped with clang compiler (installed together with Xcode application), aliased as gcc/g++. This means that when using the Makefile on Mac Os, the g++ invocation will actually refer to clang. Unfortunately, the latter does not support OpenMP, so the compilation of indie will fail.
Recommended procedure on Mac OS platforms is hence to install the GNU C compiler through Macports or Homebrew (to be installed in turn) and then use it for compiling indie.
It's a good practice to avoid assigning the same alias gcc
to the freshly installed GNU compiler as Apple clang; otherwise, many potential failures may occur during compilation of other packages.
Download the latest version of the indie source code from Github and extract the files from the archive, if not already done automatically by the OS.
Copy the directory containing the source files in the desired position and then proceed to the compilation:
make
Makefile invokes by default g++
compiler; if such command/alias actually refers to a compiler that does not support OpenMP, explicit reference has to be passed when using the make command through CXX
flag. As an example, if wanting to use a compiler named g++-9
:
make CXX=g++-9
indie executable then appears in the current directory.
No full installation is currently provided, so that indie executable and supplied models won't be copied in the system default directory. Every invocation to the indie software should be done in the folder where the executable actually is, together with models
directory.
indie has several command-line options, that can be listed by using the -h
or --help
option:
./indie -h
Here a summary:
List of command-line options
-a, --alpha [req. arg.] sets the global learning rate parameter to the specified value
-b, --batch_name [req. arg.] creates a directory with the specified name
and puts all the output files in it
-g, --generate [req. arg.] generate N synthetic sequences (where N has to be inserted
as argument
-h, --help [no arg.] prints all the possible command-line options
and if they require any (optional) argument
-i, --inference [no arg.] performs the inference on model parameters
by maximizing the total alignment likelihood
-l, --log [no arg.] prints on auxiliary files optional info about the convergence
of the algorithm during the inference step
-p, --parameters [req. arg.] reads from the specified file the parameters to be used
for the generation step or as the initial condition for the inference step
-r, --read [req. arg.] reads from the specified file the dataset
of sequences to analyze
-T, --max_iter [req. arg.] sets the maximum number of gradient descent iterations
All the options need to be specified at the same time, with a single invocation to ./indie
.
A typical command-line string to read from a file, perform a deterministic annotation and then infer the model parameters would be something like:
./indie -b wk_dir -r input_file.csv -i
where the flag -b
specifies the working directory where to store output files. If such directory is not yet existing, it will be created, together with the necessary intermediate folders along the path.
Instead of reading an existing file, a custom synthetic dataset of size N can be generated, and then the inference can be launched on it as well by combining the -g
flag with the -i
one:
./indie -b wk_dir -g N -i
Further details about the various options can be found below.
indie can be executed on an existing dataset, with input file to be specified by means of the -r
flag
./indie -r input_file.csv
This file has to be formatted according to a specific comma-separated syntax (with semicolon as field delimiter):
seq_ID | aligned_seq_nt | V_best | V_best_start | V_best_end |
---|---|---|---|---|
0 | CAGGTGCAGCTGGTGGAGTCTGGGGGGGGC... | IGHV3-30*12 | 0 | 296 |
1 | GAGGTGCAGCTGGTGGAGTCTGGGGGAGGC... | IGHV3-48*04 | 0 | 296 |
2 | TAGGTGAAGCTCGCCGAGGTGAAGAAGCGT... | IGHV1-2*01 | 0 | 296 |
3 | CAGGTGCAGTTGCAGAGTTCGCTCCCAGGT... | IGHV4-61*03 | 0 | 299 |
... | ... | ... | ... | ... |
Last columns refer to the name of the best-scoring V template and the limits of the aligned region along the template itself. Any Ig alignment tool could be used to obtain these info; we recommend igBlast for its reliability and speed, but other tools can be exploited as well.
Under the preprocessing
folder of the repository several auxiliary scripts can be found, among which an annotation pipeline based on igBlast that eventually formats sequences (segregated and quality-filtered according to customizable parameters) in the format shown above. It is even possible to start from high-throughput raw reads, using the preprocessing script under the same folder. Please refer to the README
file in it for further details.
Once read the input file, indie creates a folder read
and stores in it a copy of the input file.
Instead of reading sequences from an existing dataset, the generate
feature of indie allows to create a synthetic dataset of human Ig sequences, according to a model fully customable. Default generative model is stored in the models
folder, and can be invoked to create a repertoire of size N through the command:
./indie -g N
A specific model parameters can be used by means of the -p
flag:
./indie -g N -p custom_parms.txt
Parameters file syntax has to be the following:
@Rates
#frac_mutated
0.90
#aver_err_rate;del_ratio;ins_ratio
0.10;0.02;0.02
@Profiles
#max_gap_length
50
#del_profile
0.095808132;0.086690783;0.078441064;0.07097641;0.064222112;0.05811057;0.052580618;0.04757691;0.043049369;0.03895268;0.035245842;0.031891757;0.028856855;0.026110762;0.023625994;0.021377684;0.019343328;0.017502567;0.015836978;0.01432989;0.012966221;0.011732322;0.010615844;0.0096056125;0.0086915177;0.0078644104;0.0071160128;0.0064388346;0.0058260985;0.0052716719;0.004770006;0.0043160799;0.0039053506;0.0035337074;0.0031974307;0.0028931549;0.0026178348;0.0023687149;0.0021433019;0.0019393397;0.0017547871;0.0015877971;0.0014366982;0.0012999783;0.001176269;0.0010643322;0.00096304761;0.00087140151;0.00078847669;0.00071344321
#ins_profile
0.095808132;0.086690783;0.078441064;0.07097641;0.064222112;0.05811057;0.052580618;0.04757691;0.043049369;0.03895268;0.035245842;0.031891757;0.028856855;0.026110762;0.023625994;0.021377684;0.019343328;0.017502567;0.015836978;0.01432989;0.012966221;0.011732322;0.010615844;0.0096056125;0.0086915177;0.0078644104;0.0071160128;0.0064388346;0.0058260985;0.0052716719;0.004770006;0.0043160799;0.0039053506;0.0035337074;0.0031974307;0.0028931549;0.0026178348;0.0023687149;0.0021433019;0.0019393397;0.0017547871;0.0015877971;0.0014366982;0.0012999783;0.001176269;0.0010643322;0.00096304761;0.00087140151;0.00078847669;0.00071344321
@Rates
section refers to the frequency of hyper-mutations events:
frac_mutated
: the probability for a generated Ig sequence to be naive, ie not mutated at all;aver_err_rate
: the average probability per base pair of having a point substitution (in fact, such uniform rate is only used in the deterministic annotation of indels, but not to generate synthetic sequences, see below);del_ratio
: the base pair probability of hosting a deletion at a given position, rescaled by the point substitution rate of that sequence;ins_ratio
: the base pair probability of hosting an insertion at a given position, rescaled by the point substitution rate of that sequence.
@Profiles
section instead codes for the length distribution of indels:
max_gap_length
: it's the maximum length allowed for a single deletion/insertion event during the indie execution; it has to be smaller than (or equal to) the maximum value of 60 base pairs hard-coded in the software;del_profile
: it's the normalized vector of probabilities ruling the length of deletions; first component is the probability of deleting one base pair, second component is the probability of deleting two base pairs, and so on; the length of such vector has bemax_gap_length
;ins_profile
: same asdel_profile
, but for insertions.
models/imgt_templates_IGHV_Homo_Sapiens_F.csv
), the corresponding frequency for each V family (default is models/V_usage.csv
), and the repertoire distribution of point mutation rates (default is models/prior.csv
). Please be careful when trying to modify such files, or directly the source code relative to the generation (and in this latter case, remember to recompile at the end of changes).
Generated sequences are stored in a standard fasta file (synthetic_seqs.fasta
) under the generate
folder, together with a formatted file (synthetic_seqs_anchored.csv
) with the same syntax required for input files, a copy of the model parameters file (generative_params.txt
), a copy of the file containing the point mutation distribution (generative_mu_err_distr.txt
), and finally a file with the set of events behind the generation of each synthetic sequence, named synthetic_scenarios.txt
:
seq_ID | mu_err | V_choice | error_positions | error_list | deletion_pos | deletion_list | insertion_pos | insertion_list |
---|---|---|---|---|---|---|---|---|
0 | 0.054128 | 112 | [34,58,70,74,...] | [T->C,T->C,C->T,T->A,...] | [] | [] | [289] | [6->TGGGCG] |
1 | 0.000359 | 138 | [195] | [G->C] | [] | [] | [] | [] |
2 | 0.133590 | 4 | [0,6,23,26,...] | [C->T,C->A,G->C,T->C,...] | [11] | [12] | [] | [] |
3 | 0.139510 | 244 | [9,15,16,17,...] | [C->T,G->A,A->G,G->T,...] | [211,261] | [19,1] | [] | [] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
After having read the input file (or having generated a synthetic dataset) indie software moves to performing a deterministic alignment against V templates suggested in the suitably formatted input file.
Alignment is done through the standard Needleman-Wunsch (NW) algorithm, and results are stored in a new folder called greedy
. Model parameters used for the initialization cabn be found in greedy/initial_params.txt
, while the ones resulting from the deterministic alignment are in greedy/greedy_params.txt
. More info about the best-scoring alignment for each sequence can be found in greedy/greedy_alignment_summary.csv
.
Once performed the deterministic annotation of the Ig dataset, corresponding results should be compared with those inferred within the probabilistic framework provided by indie. To this aim, it's necessary to provide the flag -i
when invoking ./indie
. By command line can be also changed the maximum number of inference steps, -T
, and the learning rate alpha, -a
.
Inference results will be stored in a dedicated folder, inference
. Again, initial and final values for model parameters can be found in inference/initial_params.txt
and inference/inferred_params.txt
, respectively.
Using the flag for a detailed report of the inference step by step, -l
, will add further files, inference/aux_infer_file_attempt_1.txt
, inference/aux_posterior_file_attempt_1.txt
and inference/aux_prior_file_attempt_1.txt
, with extra info about the values of model parameters, repertoire and sequence-specific point mutation distributions at each inference time step. The integer number at the end of the file names stands for the number of times the inference procedure has been launched. According to the examined dataset, indeed, it may be necessary to decrease the initial learning rate and hence restart the inference a few times.