-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
138 changed files
with
18,730 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# How works ORFold? | ||
|
||
ORFold is a tool developed in python3 which aims at | ||
characterizing the fold potential of a set of amino acid sequences with no | ||
knowledge of their 3D structures nor evolutionary information (orphan sequences can be treated). | ||
The fold potential of a sequence is | ||
calculated with the HCA | ||
method. Also ORFold can estimate the disorder, | ||
and the aggregation propensities of the input sequences with IUPred | ||
and Tango respectively. | ||
|
||
|
||
## Hydrophobic Clusters Analysis (HCA) | ||
**HCA**[1] aims as delineating in an amino acid sequence, regions enriched in | ||
strong hydrophobic residues (HCA clusters) and regions | ||
of at least four consecutive non-hydrophobic residues (HCA linkers). | ||
The patterns of hydrophobic residues can be associated with specific regular | ||
secondary structures, and the distribution of the HCA clusters and linkers in a protein | ||
sequence can be used to estimate through the HCA score, its ability to fold (completely or partially). | ||
|
||
This score ranges from -10 to +10 with low HCA scores indicating | ||
sequences depleted in hydrophobic clusters and expected to be disordered in solution, | ||
while high HCA scores reflect sequences enriched in hydrophobic clusters | ||
and expected to generate aggregates in solution, though some of them could | ||
fold in lipidic environments. | ||
Foldable sequences are known to display | ||
an equilibrium between hydrophobic and hydrophilic residues (average of 33% | ||
of hydrophobic residues in globular proteins). | ||
They are mostly associated with intermediate HCA score values. | ||
|
||
|
||
The HCA score is calculated using the freely available | ||
software **pyHCA** which can be downloaded and installed | ||
following the instructions of its developers: <https://github.com/T-B-F/pyHCA> | ||
|
||
|
||
## Tango | ||
**Tango**[5][6][7] is a method which aims at predicting aggregation nucleating regions | ||
in protein sequences. | ||
If specified by the user, ORFold can calculate and add the aggregation propensity | ||
of a sequence in the output. | ||
Tango is not freely available software, and the user of ORFold should | ||
first contact the Tango developers to have access to the source code: <http://tango.crg.es> | ||
|
||
For the aggregation propensity estimation, according to the protocol | ||
proposed by XXX et al[REF], a residue is considered as | ||
participating in an aggregation prone region if it is located in a segment | ||
of at least five consecutive residues which were predicted as populating | ||
a b-aggregated conformation for more than 5%. | ||
Then, the aggregation propensity of each sequence is defined as the | ||
fraction of residues predicted in aggregation prone segments. | ||
|
||
## IUPred | ||
**IUPred2A**[2][3][4] is one of the best methods for the prediction of | ||
Intrinsically Disordered Proteins (IDPs) and can be used as a | ||
complement to the HCA score prediction. | ||
If specified by the user, ORFold can calculate and add the disorder propensity | ||
of a sequence in the output. | ||
IUPred is not freely available, and the user of | ||
ORFold should first contact the IUPred developers to | ||
have access to the source code : <https://iupred2a.elte.hu> | ||
|
||
For the disorder propensity estimation, in order to be consistent with | ||
the estimation of the aggregation propensity, ORFold searches for | ||
regions on the protein sequence that present at least five consecutive | ||
residues with a disorder probability higher than 0.5. | ||
The disorder propensity of each sequence is defined as the fraction | ||
of residues predicted as located in a highly disordered segment. | ||
|
||
|
||
|
||
<br><br><br> | ||
#### References | ||
|
||
1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018). | ||
2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005). | ||
3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018). | ||
4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018). | ||
5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004). | ||
6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004). | ||
7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
{ width=30% } | ||
# Aims and general description of ORFold | ||
|
||
ORFold aims at estimating the fold potential of a set of amino acid sequences | ||
using the **Hydrophobic Clusters Analysis (HCA)** method [1]. | ||
We define the foldability of an amino acid sequence as its ability | ||
to fold to a stable 3D structure or to a molten globule state in which the specific | ||
tertiary structure is lost whereas the secondary structures are intact. | ||
|
||
ORFold calculates the HCA foldability score of each given sequence. | ||
Furthermore, ORFold can estimate the | ||
disorder and/or aggregation propensities of the input sequences using | ||
the **IUPred**[2][3][4] and **Tango**[5][6][7] methods respectively. | ||
|
||
<br><br><br> | ||
#### References | ||
|
||
1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018). | ||
2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005). | ||
3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018). | ||
4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018). | ||
5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004). | ||
6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004). | ||
7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Plot of the ORFold output | ||
|
||
|
||
The output table generated by ORFold can be subsequently given to ORFold | ||
to generate a plot of the HCA score distribution. | ||
The user can provide several tables in order to compare different HCA score | ||
distribution. In this case, ORFplot will plot all the distributions on the same plot | ||
(the tables must be given with the **-tab** option). | ||
By default, the names used in the legend of the resulting plot | ||
are the root names of the input table files. | ||
However, the user can write his own names in the legend | ||
with the **-names** option. The names must be given in the same order | ||
as the table files. | ||
|
||
orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab | ||
|
||
This example will generate the HCA score distributions of the sequences | ||
stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files. | ||
The resulting legend will be sequences_Y, sequences_X, and sequences_Z respectively. | ||
|
||
orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names Noncoding Coding Translated | ||
|
||
This example will generate the HCA score distributions of the sequences | ||
stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files. | ||
The resulting legend will be Noncoding, Coding and Translated, respectively. | ||
|
||
<div class="admonition note"> | ||
<p class="first admonition-title"> | ||
Note | ||
</p> | ||
<p class="last"> | ||
If the names consist of single words the user can write them the | ||
one after the other as shown in the example above. However, if the user | ||
wishes to use multiple words in the legend labels (ie Noncoding sequences - | ||
Homo sapiens , Coding sequences - Homo sapiens, Translated sequences - | ||
Homo sapiens) they must be enclosed in double quotes. | ||
```{} | ||
orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names "Noncoding sequences - Homo sapiens" "Coding sequences - Homo sapiens" "Translated sequences - Homo sapiens" | ||
``` | ||
</p> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
## Running ORFold: | ||
|
||
|
||
### Inputs | ||
Basically, ORFold requires only a FASTA file containing the amino acid sequences | ||
to treat (given with the **-fna** label). ORFold can handle several FASTA files at the same | ||
time. In this case, it will treat them independently and will generate as many | ||
outputs as entered FASTA files. | ||
|
||
|
||
FASTA file example: | ||
```{} | ||
>aminoacid_sequence_1 | ||
AGNVCFGGRTYMPDFDGMSCVNWQERT | ||
>aminoacid_sequence_2 | ||
MPDFMPCNVSDRTEEEPMSPARTYDFGHKLCVSDFTPMLKKPERT | ||
``` | ||
|
||
|
||
### How to estimate the fold potential and/or disorder and aggregation propensities | ||
By default, ORFold only estimates the fold potential of the input sequences. | ||
The disorder and aggregation propensities can be however calculated as well. | ||
The user can specify which calculation methods are to be launched with | ||
the **-options** argument. | ||
|
||
Each method used by ORFold is referred by its initial: | ||
<pre> | ||
HCA : H | ||
IUPred : I | ||
TANGO : T | ||
</pre> | ||
|
||
The user must specify the combination of methods he wants to apply | ||
on the input sequences giving their initials with the **-options** argument | ||
without any space: ```-options HIT``` for running the 3 programs or ```-options HT``` | ||
if the user wants to run only HCA and Tango for example. | ||
The order of the letters has no importance, | ||
```-options HIT``` and ```-options THI``` will lead to the same result. | ||
|
||
|
||
### Basic run | ||
The following instruction estimates the fold potential, and the disorder and aggregation propensities of | ||
all amino acid sequences contained in the input fASTA file: | ||
|
||
```{bash} | ||
orfold -fna sequences.fasta -options HIT | ||
``` | ||
|
||
|
||
|
||
The user has to notice that **IUPred** and **Tango** provide additional information | ||
to HCA but will slow down considerably ORFold for large datasets. | ||
The next instruction only calculate the fold potential with HCA: | ||
```{bash} | ||
orfold -fna sequences.fasta -options H | ||
``` | ||
|
||
### Output: | ||
ORFold produces a table (fasta_rootname.tab) that contains for each input sequence, | ||
the computed values (separated by tabulations) according to the user request (fold potential, and/or disorder and/or | ||
aggregation propensities). | ||
|
||
|
||
|
||
|
||
|
||
Output file example with ```-options HIT``` (fold potential, disorder and aggregation | ||
propensities estimated from HCA, IUPred and Tango, see [here](./How_it_works_orfold.md) for more details): | ||
|
||
```{} | ||
Seq_ID HCA Disord Aggreg | ||
aminoacid_sequence_1 1.340 0.000 0.230 | ||
aminoacid_sequence_2 -0.230 0.120 0.012 | ||
``` | ||
|
||
Output file example with ```-options H``` (fold potential estimated with HCA, see [here](./How_it_works_orfold.md) for more details): | ||
```{} | ||
Seq_ID HCA Disord Aggreg | ||
aminoacid_sequence_1 1.340 nan nan | ||
aminoacid_sequence_2 -0.230 nan nan | ||
``` |
Oops, something went wrong.