Add docs to main

i2bc · Dec 18, 2020 · f745863 · f745863
1 parent bbd0335
commit f745863
Show file tree

Hide file tree

Showing 138 changed files with 18,730 additions and 0 deletions.
diff --git a/docs/.DS_Store b/docs/.DS_Store
diff --git a/docs/How_it_works_orfold.md b/docs/How_it_works_orfold.md
@@ -0,0 +1,81 @@
+# How works ORFold?
+
+ORFold is a tool developed in python3 which aims at 
+characterizing the fold potential of a set of amino acid sequences with no 
+knowledge of their 3D structures nor evolutionary information (orphan sequences can be treated). 
+The fold potential of a sequence is 
+calculated with the HCA 
+method. Also ORFold can estimate the disorder, 
+and the aggregation propensities of the input sequences with IUPred
+and Tango respectively.    
+
+
+## Hydrophobic Clusters Analysis (HCA)
+**HCA**[1] aims as delineating in an amino acid sequence, regions enriched in 
+strong hydrophobic residues (HCA clusters) and regions 
+of at least four consecutive non-hydrophobic residues (HCA linkers). 
+The patterns of hydrophobic residues can be associated with specific regular 
+secondary structures, and the distribution of the HCA clusters and linkers in a protein 
+sequence can be used to estimate through the HCA score, its ability to fold (completely or partially). 
+
+This score ranges from -10 to +10 with low HCA scores indicating
+sequences depleted in hydrophobic clusters and expected to be disordered in solution, 
+while high HCA scores reflect sequences enriched in hydrophobic clusters 
+and expected to generate aggregates in solution, though some of them could
+fold in lipidic environments. 
+Foldable sequences are known to display
+an equilibrium between hydrophobic and hydrophilic residues (average of 33% 
+of hydrophobic residues in globular proteins). 
+They are mostly associated with intermediate HCA score values.
+
+
+The HCA score is calculated using the freely available 
+software **pyHCA** which can be downloaded and installed 
+following the instructions of its developers: <https://github.com/T-B-F/pyHCA>
+
+
+## Tango
+**Tango**[5][6][7] is a method which aims at predicting aggregation nucleating regions
+in protein sequences. 
+If specified by the user, ORFold can calculate and add the aggregation propensity 
+of a sequence in the output. 
+Tango is not freely available software, and the user of ORFold should 
+first contact the Tango developers to have access to the source code: <http://tango.crg.es>
+
+For the aggregation propensity estimation, according to the protocol
+proposed by XXX et al[REF], a residue is considered as
+participating in an aggregation prone region if it is located in a segment 
+of at least five consecutive residues which were predicted as populating 
+a b-aggregated conformation for more than 5%. 
+Then, the aggregation propensity of each sequence is defined as the 
+fraction of residues predicted in aggregation prone segments. 
+
+## IUPred
+**IUPred2A**[2][3][4] is one of the best methods for the prediction of 
+Intrinsically Disordered Proteins (IDPs) and can be used as a 
+complement to the HCA score prediction. 
+If specified by the user, ORFold can calculate and add the disorder propensity 
+of a sequence in the output. 
+IUPred is not freely available, and the user of 
+ORFold should first contact the IUPred developers to 
+have access to the source code : <https://iupred2a.elte.hu>
+
+For the disorder propensity estimation, in order to be consistent with
+the estimation of the aggregation propensity, ORFold searches for 
+regions on the protein sequence that present at least five consecutive 
+residues with a disorder probability higher than 0.5. 
+The disorder propensity of each sequence is defined as the fraction 
+of residues predicted as located in a highly disordered segment.    
+
+
+
+<br><br><br>
+#### References
+
+1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
+2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005).
+3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018).
+4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).
+5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004).
+6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).
+7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006).
diff --git a/docs/Objective_orfold.md b/docs/Objective_orfold.md
@@ -0,0 +1,25 @@
+![LOGO_ORFold](./img/icons/Logo_ORFold.eps){ width=30% }
+# Aims and general description of ORFold
+
+ORFold aims at estimating the fold potential of a set of amino acid sequences
+using the **Hydrophobic Clusters Analysis (HCA)** method [1]. 
+We define the foldability of an amino acid sequence as its ability 
+to fold to a stable 3D structure or to a molten globule state in which the specific 
+tertiary structure is lost whereas the secondary structures are intact.
+
+ORFold calculates the HCA foldability score of each given sequence. 
+Furthermore, ORFold can estimate the 
+disorder and/or aggregation propensities of the input sequences using 
+the **IUPred**[2][3][4] and **Tango**[5][6][7] methods respectively.
+
+<br><br><br>
+#### References
+
+1. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
+2. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005).
+3. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018).
+4. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).
+5. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004).
+6. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).
+7. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006).
+
diff --git a/docs/Plot_orfold.md b/docs/Plot_orfold.md
@@ -0,0 +1,41 @@
+# Plot of the ORFold output
+
+
+The output table generated by ORFold can be subsequently given to ORFold 
+to generate a plot of the HCA score distribution. 
+The user can provide several tables in order to compare different HCA score
+distribution. In this case, ORFplot will plot all the distributions on the same plot
+(the tables must be given with the **-tab** option). 
+By default, the names used in the legend of the resulting plot 
+are the root names of the input table files. 
+However, the user can write his own names in the legend 
+with the **-names** option. The names must be given in the same order 
+as the table files. 
+
+	orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab
+
+This example will generate the HCA score distributions of the sequences 
+stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files. 
+The resulting legend will be sequences_Y, sequences_X, and sequences_Z respectively. 
+
+	orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names Noncoding Coding Translated
+
+This example will generate the HCA score distributions of the sequences 
+stored in the sequences_Y.tab, sequences_X.tab and sequences_Z.tab files.
+The resulting legend will be Noncoding, Coding and Translated, respectively.
+
+<div class="admonition note">
+    <p class="first admonition-title">
+        Note
+    </p>
+    <p class="last">
+        If the names consist of single words the user can write them the 
+one after the other as shown in the example above. However, if the user 
+wishes to use multiple words in the legend labels (ie Noncoding sequences - 
+Homo sapiens , Coding sequences - Homo sapiens, Translated sequences - 
+Homo sapiens) they must be enclosed in double quotes. 
+```{}
+	orfplot -tab sequences_Y.tab sequences_X.tab sequences_Z.tab -names "Noncoding sequences - Homo sapiens" "Coding sequences - Homo sapiens" "Translated sequences - Homo sapiens"
+```
+    </p>
+</div>
diff --git a/docs/Run_orfold.md b/docs/Run_orfold.md
@@ -0,0 +1,81 @@
+## Running ORFold:
+
+
+### Inputs
+Basically, ORFold requires only a FASTA file containing the amino acid sequences 
+to treat (given with the **-fna** label). ORFold can handle several FASTA files at the same
+time. In this case, it will treat them independently and will generate as many 
+outputs as entered FASTA files.
+
+
+ FASTA file example:
+```{}
+>aminoacid_sequence_1
+AGNVCFGGRTYMPDFDGMSCVNWQERT
+>aminoacid_sequence_2
+MPDFMPCNVSDRTEEEPMSPARTYDFGHKLCVSDFTPMLKKPERT
+```
+
+
+### How to estimate the fold potential and/or disorder and aggregation propensities
+By default, ORFold only estimates the fold potential of the input sequences. 
+The disorder and aggregation propensities can be however calculated as well.
+The user can specify which calculation methods are to be launched with 
+the **-options** argument. 
+
+Each method used by ORFold is referred by its initial: 
+<pre>
+   HCA     : H
+   IUPred  : I
+   TANGO   : T 
+</pre>
+
+The user must specify the combination of methods he wants to apply
+on the input sequences giving their initials with the **-options** argument
+without any space: ```-options HIT``` for running the 3 programs or ```-options HT``` 
+if the user wants to run only HCA and Tango for example. 
+The order of the letters has no importance, 
+```-options HIT``` and ```-options THI``` will lead to the same result.
+
+
+### Basic run
+The following instruction estimates the fold potential, and the disorder and aggregation propensities of
+all amino acid sequences contained in the input fASTA file:
+
+```{bash}
+orfold -fna sequences.fasta -options HIT
+```
+
+
+
+The user has to notice that **IUPred** and **Tango** provide additional information
+to HCA but will slow down considerably ORFold for large datasets. 
+The next instruction only calculate the fold potential with HCA:
+```{bash}
+orfold -fna sequences.fasta -options H
+```
+
+### Output:
+ORFold produces a table (fasta_rootname.tab) that contains for each input sequence,
+the computed values (separated by tabulations) according to the user request (fold potential, and/or disorder and/or 
+aggregation propensities).
+
+
+
+
+
+ Output file example with ```-options HIT``` (fold potential, disorder and aggregation
+ propensities estimated from HCA, IUPred and Tango, see [here](./How_it_works_orfold.md) for more details):
+
+```{}
+Seq_ID                  HCA     Disord  Aggreg
+aminoacid_sequence_1	1.340	0.000	0.230	
+aminoacid_sequence_2	-0.230	0.120	0.012	
+```
+
+Output file example with ```-options H``` (fold potential estimated with HCA, see [here](./How_it_works_orfold.md) for more details):
+```{}
+Seq_ID                  HCA     Disord  Aggreg
+aminoacid_sequence_1    1.340   nan     nan
+aminoacid_sequence_2    -0.230  nan     nan
+```