Scripts and data files for "Encoded lysines at the beginning of coding sequences enhance translation via their A-enriched codons and side chain"
Scripts are under the CeCILL 2.1 licence which is a free software licence, explicitly compatible with the GNU GPL.
Contacts
- Claire Toffano-Nioche ([email protected])
- Daniel Gautheret ([email protected])
- Jean Lehmann ([email protected])
Clone (git clone [email protected]:i2bc/n_terminal_lysine
) or donwload this archive.
In a n_terminal_lysine
repository, you may have:
- KRNHYIaa_GCA_000146045.2.pdf : the graph of some aa frequencies by position for Saccharomyces cerevisiae S288C R64 strain
- aaFreqByPos.R: R script to create a graph
- compute.sh: compute measurements and create summary table (isoelectric point, CG%, Plong, Pavg, and P2nd)
- conda_env_Rbase_n_terminal_lysine.yml: recipe to create a conda environment and run R script
- conda_env_compute_n_terminal_lysine.yml: recipe to create a conda environment and run shell script
- get_x_nt.awk: get x fisrt nucleotides for each sequence of a multiple fasta file (used in compute.sh)
- prepare_graphs.sh: prepare a shell file to create graph
- ratio_for_?codons.awk: compute specific ratios from positional frequencies of codon (ratio with 2 codon frequencies eg. AGA/AGG ; R ratios: AGA/(AGA+AGG) and (AGA+AGG)/(AGA+AGG+CGA+CGC+CGG+CGT) ; K ratios: AAA/AAG and AAA/(AAA+AGG) ) with a minimal threshold for the denominator
- transpose.awk: matrix transposition (used in compute.sh)
- README.md (this file)
- species_Archaea.list, species_Bacteria.list, species_Eukaryotes.list, and bact_9.list (lists of cdna_sequences used in the article)
In order to increase the reproducibility of the computational analyses we worked with conda environements (see the yml files to create them, conda_env_*_n_terminal_lysine.yml
). Otherwise, softwares emboss (version 6.0) and R (version 4.3.1, including dplyr and ggplot2 libraries) are needed.
We download the 200617_TEMPURA.csv, a table containing optimal growth temperature for some archaea and bacteria species.
bacteria and archaea species were selected when they have both a Topt_average and an Assembly_or_asseccion values in the 200617_TEMPURA.csv and a corresponding cDNA files (*cdna_from_genomics files.fasta.gz
) to download from the ncbi ftp site.
eukaryota selection: downloaded in may 2023 from an ncbi assembly query with eukaryotes[organism] AND “reference genome"[RefSeq Category] added with green alga, african frog and fission yeast. Downloaded cDNA files versions in may 2023 (*_cds_from_genomic.fna.gz) are:
GCA_000001215.4_Release_6_plus_ISO1_MT # Drosophila melanogaster 6 # fruit fly
GCA_000001735.2_TAIR10.1 # Arabidopsis thaliana TAIR10 # thale cress
GCA_000002595.3_Chlamydomonas_reinhardtii_v5.5 # Chlamydomonas reinhardtii v5.5 # green alga
GCA_000002945.2_ASM294v2_SchPom # Schizosaccharomyces pombe ASM294v2 # fission yeast
GCA_000002985.3_WBcel235 # Caenorhabditis elegans cel235 # worm
GCA_000146045.2_R64SacCer # Saccharomyces cerevisiae S288C # baker's yeast
GCF_000001405.40_GRCh38.p14 # Homo sapiens GRCh38.p14 # human
GCF_000001635.27_GRCm39 # Mus musculus GRCm39 # house mouse
GCF_000002035.6_GRCz11 # Danio rerio GRCz11 # zebrafish
GCF_017654675.1_J_2021 # Xenopus laevis # African frog
Run first the compute.sh
script to get a summary table, the prepare_graphs.sh
script to create a shell script adapted to your data, and next this last shell script to obtains one frequencies amino-acids graph by species.
Run n_terminal_lysine/compute.sh
to get a summary table, *.allAA_Plong_geeceeCDS_Toptave_iep_sp.tsv
This table contains 83 columns:
- GCA_ID: the ncbi GCA identifier
- X_P2nd, X_Pavg, X_Plong: 78 values for the 3 measurements to characterize a peak for each letter of the alphabet standing for aa
- geeceeCDS: mean of the GC% computed on the CDS aa sequences
- tempura_Topt_ave: optimal growth temperature extracted from the TEMPURA database when available
- iepCDS: mean of the iso-electric points computed on the CDS aa sequences
- species_name: name of the species
Run n_terminal_lysine/prepare_graphs.sh
(and run the resulting file, *KRNHYIaa_aaFreqByPos_graphes.sh
) to get graphes (Tmp/KRNHYIaa_*.pdf
and Tmp/KRNHYIaa_*.png
) for each species with the K, R, N, H, Y, and I aa. KRNHYIaa_GCA_000146045.2.pdf
is an example for the the Saccharomyces cerevisiae S288C R64 strain.
Go to the n_terminal_lysine
repository.
Download data (TEMPURA DB, select data for 2 bacteria, 2 archaea, 2 eukaryota):
wget http://togodb.org/release/200617_TEMPURA.csv
mkdir -p ncbi_gca ; cd ncbi_gca ;
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/305/GCA_000007305.1_ASM730v1/GCA_000007305.1_ASM730v1_cds_from_genomic.fna.gz ;
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/085/GCA_000007085.1_ASM708v1/GCA_000007085.1_ASM708v1_cds_from_genomic.fna.gz ;
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/009/965/GCA_000009965.1_ASM996v1/GCA_000009965.1_ASM996v1_cds_from_genomic.fna.gz ;
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/010/305/GCA_000010305.1_ASM1030v1/GCA_000010305.1_ASM1030v1_cds_from_genomic.fna.gz ;
wget ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_cds_from_genomic.fna.gz ;
wget ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/945/GCA_000002945.2_ASM294v2/GCA_000002945.2_ASM294v2_cds_from_genomic.fna.gz ;
cd .. ;
Create 3 files that lists selected archaea, bacteria and eukaryota species with genbank identifier in column 1 and species name in column 2 (see species_Archaea.list
, species_Bacteria.list
and species_Eukaryota.list
files for the lists used in the article).
echo GCA_000007085.1$'\t'Caldanaerobacter subterraneus subsp. tengcongensis MB4$'\n'GCA_000010305.1$'\t'Gemmatimonas aurantiaca T-27 > bact.list ;
echo GCA_000007305.1$'\t'Pyrococcus furiosus DSM 3638$'\n'GCA_000009965.1$'\t'Thermococcus kodakarensis KOD1 > arch.list ;
echo GCA_000146045.2$'\t'Saccharomyces cerevisiae S288C R64$'\n'GCA_000002945.2$'\t'Schizosaccharomyces pombe > euka.list ;
You may have:
.
├── 200617_TEMPURA.csv
├── arch.list
├── bact.list
├── euka.list
├── ncbi_gca
│ ├── GCA_000002945.2_ASM294v2_cds_from_genomic.fna.gz
| ├── GCA_000007085.1_ASM708v1_cds_from_genomic.fna.gz
│ ├── GCA_000007305.1_ASM730v1_cds_from_genomic.fna.gz
│ ├── GCA_000009965.1_ASM996v1_cds_from_genomic.fna.gz
│ ├── GCA_000010305.1_ASM1030v1_cds_from_genomic.fna.gz
│ └── GCA_000146045.2_R64_cds_from_genomic.fna.gz
└── ...
Compute measurements:
Activate the conda environment: conda activate emboss_ce
For eukaryota, end with 0 in place of 11 for archaea and bacteria (genetic code, -table
option of the emboss::transeq
tool).
for sp in arch bact ; do bash compute.sh ${sp}.list ../n_terminal_lysine 11 ; done
for sp in euka ; do bash compute.sh ${sp}.list ../n_terminal_lysine 0 ; done
which generates:
.
├── ...
├── arch.list.allAA_Plong_geeceeCDS_Toptave_iep_sp.tsv
├── bact.list.allAA_Plong_geeceeCDS_Toptave_iep_sp.tsv
├── euka.list.allAA_Plong_geeceeCDS_Toptave_iep_sp.tsv
└── Tmp
├── arch.list.geeceeCDS.tsv
├── arch.list.iepCDS.tsv
├── arch.list.P2nd_Pavg_Plong.tsv
├── arch.list.tempura_Topt_ave.tsv
├── bact.list.geeceeCDS.tsv
├── bact...
├── GCA_000002945.2_60nt_cfg.fna
├── GCA_000002945.2_60nt.faa
├── GCA_000002945.2_allAA.csv
├── GCA_000002945.2.geecee
├── GCA_000002945.2.iep
├── GCA_000002945.2.prophecy
├── GCA_000002945.2.transeq
├── GCA_000007085.1_60nt_cfg.fna
├── ...
└── GCA_000146045.2.transeq
The *.allAA_Plong_geeceeCDS_Toptave_iep_sp.tsv
are the resulting tables.
The specific ratios of positional frequencies of codon are computed for 9 selected bacterial species (listed in bact_9.list
):
(ratio with 2 codon frequencies eg. AGA/AGG ; R ratios: AGA/(AGA+AGG) and (AGA+AGG)/(AGA+AGG+CGA+CGC+CGG+CGT) ; K ratios: AAA/AAG and AAA/(AAA+AGG) ) with a minimal threshold for the denominator
Create graphs for amino acids K, R, N, H, Y, and I frequencies on the first 20 protein residus:
Activate the conda environment: conda activate Rgraph_n_terminal_lysine
The prepare_graphs.sh
script prepares a shell script, *KRNHYIaa_aaFreqByPos_graphes.sh
, for each list of species. And when it is run, it will create one graph for each species (need Rscript):
for i in arch bact euka ; do
bash prepare_graphs.sh $i.list ../n_terminal_lysine 20 ;
bash $i.list.KRNHYIaa_aaFreqByPos_graphes.sh ;
done
which generates:
.
├── ...
├── arch.list.KRNHYIaa_aaFreqByPos_graphes.sh
├── bact.list.KRNHYIaa_aaFreqByPos_graphes.sh
├── euka.list.KRNHYIaa_aaFreqByPos_graphes.sh
├── ...
└── Tmp
├── ...
├── KRNHYIaa_GCA_000002945.2.csv
├── KRNHYIaa_GCA_000002945.2.pdf
├── KRNHYIaa_GCA_000002945.2.png
├── ...
└── KRNHYIaa_GCA_000146045.2.png
Resulting graphs stand in the Tmp
repository (pdf
and png
formats).
The graph for the Saccharomyces cerevisiae S288C R64 strain, KRNHYIaa_GCA_000146045.2.pdf
, is supplied as an example of the correct operation of the codes.