Skip to content

Commit

Permalink
Add docs to main
Browse files Browse the repository at this point in the history
  • Loading branch information
nchenche committed Dec 18, 2020
1 parent bbd0335 commit f745863
Show file tree
Hide file tree
Showing 138 changed files with 18,730 additions and 0 deletions.
Binary file added docs/.DS_Store
Binary file not shown.
81 changes: 81 additions & 0 deletions docs/How_it_works_orfold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# How works ORFold?

ORFold is a tool developed in python3 which aims at
characterizing the fold potential of a set of amino acid sequences with no
knowledge of their 3D structures nor evolutionary information (orphan sequences can be treated).
The fold potential of a sequence is
calculated with the HCA
method. Also ORFold can estimate the disorder,
and the aggregation propensities of the input sequences with IUPred
and Tango respectively.


## Hydrophobic Clusters Analysis (HCA)
**HCA**[1] aims as delineating in an amino acid sequence, regions enriched in
strong hydrophobic residues (HCA clusters) and regions
of at least four consecutive non-hydrophobic residues (HCA linkers).
The patterns of hydrophobic residues can be associated with specific regular
secondary structures, and the distribution of the HCA clusters and linkers in a protein
sequence can be used to estimate through the HCA score, its ability to fold (completely or partially).

This score ranges from -10 to +10 with low HCA scores indicating
sequences depleted in hydrophobic clusters and expected to be disordered in solution,
while high HCA scores reflect sequences enriched in hydrophobic clusters
and expected to generate aggregates in solution, though some of them could
fold in lipidic environments.
Foldable sequences are known to display
an equilibrium between hydrophobic and hydrophilic residues (average of 33%
of hydrophobic residues in globular proteins).
They are mostly associated with intermediate HCA score values.


The HCA score is calculated using the freely available
software **pyHCA** which can be downloaded and installed
following the instructions of its developers: <https://github.com/T-B-F/pyHCA>


## Tango
**Tango**[5][6][7] is a method which aims at predicting aggregation nucleating regions
in protein sequences.
If specified by the user, ORFold can calculate and add the aggregation propensity
of a sequence in the output.
Tango is not freely available software, and the user of ORFold should
first contact the Tango developers to have access to the source code: <http://tango.crg.es>

For the aggregation propensity estimation, according to the protocol
proposed by XXX et al[REF], a residue is considered as
participating in an aggregation prone region if it is located in a segment
of at least five consecutive residues which were predicted as populating
a b-aggregated conformation for more than 5%.
Then, the aggregation propensity of each sequence is defined as the
fraction of residues predicted in aggregation prone segments.

## IUPred
**IUPred2A**[2][3][4] is one of the best methods for the prediction of
Intrinsically Disordered Proteins (IDPs) and can be used as a
complement to the HCA score prediction.
If specified by the user, ORFold can calculate and add the disorder propensity
of a sequence in the output.
IUPred is not freely available, and the user of
ORFold should first contact the IUPred developers to
have access to the source code : <https://iupred2a.elte.hu>

For the disorder propensity estimation, in order to be consistent with
the estimation of the aggregation propensity, ORFold searches for
regions on the protein sequence that present at least five consecutive
residues with a disorder probability higher than 0.5.
The disorder propensity of each sequence is defined as the fraction
of residues predicted as located in a highly disordered segment.



<br><br><br>
#### References

1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005).
3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018).
4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).
5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004).
6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).
7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006).
25 changes: 25 additions & 0 deletions docs/Objective_orfold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
![LOGO_ORFold](./img/icons/Logo_ORFold.eps){ width=30% }
# Aims and general description of ORFold

ORFold aims at estimating the fold potential of a set of amino acid sequences
using the **Hydrophobic Clusters Analysis (HCA)** method [1].
We define the foldability of an amino acid sequence as its ability
to fold to a stable 3D structure or to a molten globule state in which the specific
tertiary structure is lost whereas the secondary structures are intact.

ORFold calculates the HCA foldability score of each given sequence.
Furthermore, ORFold can estimate the
disorder and/or aggregation propensities of the input sequences using
the **IUPred**[2][3][4] and **Tango**[5][6][7] methods respectively.

<br><br><br>
#### References

1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005).
3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018).
4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).
5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004).
6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).
7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006).

41 changes: 41 additions & 0 deletions docs/Plot_orfold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Plot of the ORFold output


The output table generated by ORFold can be subsequently given to ORFold
to generate a plot of the HCA score distribution.
The user can provide several tables in order to compare different HCA score
distribution. In this case, ORFplot will plot all the distributions on the same plot
(the tables must be given with the **-tab** option).
By default, the names used in the legend of the resulting plot
are the root names of the input table files.
However, the user can write his own names in the legend
with the **-names** option. The names must be given in the same order
as the table files.

orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab

This example will generate the HCA score distributions of the sequences
stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files.
The resulting legend will be sequences_Y, sequences_X, and sequences_Z respectively.

orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names Noncoding Coding Translated

This example will generate the HCA score distributions of the sequences
stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files.
The resulting legend will be Noncoding, Coding and Translated, respectively.

<div class="admonition note">
<p class="first admonition-title">
Note
</p>
<p class="last">
If the names consist of single words the user can write them the
one after the other as shown in the example above. However, if the user
wishes to use multiple words in the legend labels (ie Noncoding sequences -
Homo sapiens , Coding sequences - Homo sapiens, Translated sequences -
Homo sapiens) they must be enclosed in double quotes.
```{}
orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names "Noncoding sequences - Homo sapiens" "Coding sequences - Homo sapiens" "Translated sequences - Homo sapiens"
```
</p>
</div>
81 changes: 81 additions & 0 deletions docs/Run_orfold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Running ORFold:


### Inputs
Basically, ORFold requires only a FASTA file containing the amino acid sequences
to treat (given with the **-fna** label). ORFold can handle several FASTA files at the same
time. In this case, it will treat them independently and will generate as many
outputs as entered FASTA files.


FASTA file example:
```{}
>aminoacid_sequence_1
AGNVCFGGRTYMPDFDGMSCVNWQERT
>aminoacid_sequence_2
MPDFMPCNVSDRTEEEPMSPARTYDFGHKLCVSDFTPMLKKPERT
```


### How to estimate the fold potential and/or disorder and aggregation propensities
By default, ORFold only estimates the fold potential of the input sequences.
The disorder and aggregation propensities can be however calculated as well.
The user can specify which calculation methods are to be launched with
the **-options** argument.

Each method used by ORFold is referred by its initial:
<pre>
HCA : H
IUPred : I
TANGO : T
</pre>

The user must specify the combination of methods he wants to apply
on the input sequences giving their initials with the **-options** argument
without any space: ```-options HIT``` for running the 3 programs or ```-options HT```
if the user wants to run only HCA and Tango for example.
The order of the letters has no importance,
```-options HIT``` and ```-options THI``` will lead to the same result.


### Basic run
The following instruction estimates the fold potential, and the disorder and aggregation propensities of
all amino acid sequences contained in the input fASTA file:

```{bash}
orfold -fna sequences.fasta -options HIT
```



The user has to notice that **IUPred** and **Tango** provide additional information
to HCA but will slow down considerably ORFold for large datasets.
The next instruction only calculate the fold potential with HCA:
```{bash}
orfold -fna sequences.fasta -options H
```

### Output:
ORFold produces a table (fasta_rootname.tab) that contains for each input sequence,
the computed values (separated by tabulations) according to the user request (fold potential, and/or disorder and/or
aggregation propensities).





Output file example with ```-options HIT``` (fold potential, disorder and aggregation
propensities estimated from HCA, IUPred and Tango, see [here](./How_it_works_orfold.md) for more details):

```{}
Seq_ID HCA Disord Aggreg
aminoacid_sequence_1 1.340 0.000 0.230
aminoacid_sequence_2 -0.230 0.120 0.012
```

Output file example with ```-options H``` (fold potential estimated with HCA, see [here](./How_it_works_orfold.md) for more details):
```{}
Seq_ID HCA Disord Aggreg
aminoacid_sequence_1 1.340 nan nan
aminoacid_sequence_2 -0.230 nan nan
```
Loading

0 comments on commit f745863

Please sign in to comment.