GitHub - fredpdavis/homolobind: proteome-wide prediction of protein and ligand binding sites

fredpdavis / homolobind Public
Notifications You must be signed in to change notification settings
Fork 0
Star 1
proteome-wide prediction of protein and ligand binding sites
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
doc		doc
examples		examples
src		src
COPYING		COPYING
README		README
Repository files navigation

----------------------------------------------------------------------------
HOMOLOBIND: Proteome-wide prediction of protein and ligand binding sites
            using structure

v1.1. September 16, 2010.  http://fredpdavis.com/homolobind
Copyright 2009,2010  Fred P. Davis
----------------------------------------------------------------------------

NOTE: HOMOLOBIND is maintained here for archiving purposes. The databases
that HOMOLOBIND depends on have dramatically changed or stopped development.

HOMOLOBIND predictions for several species are archived at Zenodo:
https://zenodo.org/record/29597


o What is HOMOLOBIND
---------------------

HOMOLOBIND identifies residues in protein sequences with significant
similarity to structurally characterized binding sites. It transfers
binding sites from LIGBASE (http://salilab.org/ligbase) and PIBASE
(http://fredpdavis.com/pibase) through ASTRAL/ASTEROIDS
(http://astral.berkeley.edu) alignments onto SUPERFAMILY
(http://supfam.org) domain assignments.

The homology transfer procedure predicts residues in ligand and protein
binding sites with estimated true positive rates of 98% and 88%,
respectively, at 1% false positive rates.

This release is based on the SCOP v1.73 domain classification.


o Documentation
---------------

 docs/homolobind_user_manual.pdf offers detailed instruction on
 installation and usage. This README is meant to offer a quick guide.

 examples/README.examples describes commands to recreate Fig. 4 and 5
 in the accompanying manuscript. The directory includes input files and
 expected output files.

 HOMOLOBIND is released under the GPL v3 license.


o Installing HOMOLOBIND
-----------------------

 Running HOMOLOBIND requires Perl, the Bit::Vector CPAN module, and wget,
 tar, and gunzip commands to retrieve external data files (~520 MB).
 Optionally, R and the RSPerl interface are required to calculate the
 significance of overlap between predicted ligand and protein binding
 sites.

 1. Add the full path of the src/perl_api directory to your PERL5LIB
    environment variable.

 2. Edit the homolobind.pm specs section (line 70) to set the directory
    where data files should be stored.

 3. If you want to run the program in parallel on an SGE-based computing
    cluster, edit the homolobind.pm specs section to:

    * specify the hostname from where jobs can be submitted (line 64)
    * the number of jobs to launch (line 65)
    * qstat polling frequency (line 66)

 4. Run: 'perl homolobind.pl -fetch_data 1' to retrieve data files from
      the ASTRAL, ASTEROIDS, HOMOLOBIND, PIBASE, and SCOP websites.

   This requires ~530MB of free space.

 5. Download SUPERFAMILY self_hits files.
   http://pibase.janelia.org/download/homolobind/v1.1/self_hits_1.73.tar.gz

   * Uncompress the file in the 'superfamily' subdirectory of the data
     directory specified in Step 2.

    mv self_hits.tar.gz HOMOLOBIND_DATA_DIRECTORY/superfamily/
    cd HOMOLOBIND_DATA_DIRECTORY/superfamily/
    tar xvfz self_hits.tar.gz


o Preparing input for HOMOLOBIND
--------------------------------

 HOMOLOBIND takes as input a list of SUPERFAMILY domain assignments.
 There are 2 ways to get these assignments:

  (i) Run SUPERFAMILY v1.73 software locally to assign domains to your sequences
      http://supfam.org/SUPERFAMILY/downloads.html

      NOTE: This option currently only works with v1.73 SUPERFAMILY software.
      SUPERFAMILY has recently (Nov 2010) updated to SCOP v1.75 domain
      definitions, and the corresponding update for the HOMOLOBIND binding
      site library will be available by the end of 2010.

 (ii) Get precomputed genomic domain assignments by installing SUPERFAMILY
      MySQL tables and querying for your species of interest.
      Three tables are required: align, ass, and family.

   http://pibase.janelia.org/download/homolobind/v1.1/align_01-Nov-2009.sql.gz (4 GB)
   http://pibase.janelia.org/download/homolobind/v1.1/ass_01-Nov-2009.sql.gz (318 MB)
   http://pibase.janelia.org/download/homolobind/v1.1/family_01-Nov-2009.sql.gz (167 MB)

   To query for all human domain assignments, run:

   mysql> SELECT ass.genome, ass.seqid, ass.model, ass.region, ass.evalue,
   align.alignment, family.evalue, family.px, family.fa FROM ass, align,
   family WHERE ass.genome = 'hs' AND ass.auto = align.auto AND
   ass.auto = family.auto ORDER BY ass.model, family.fa ;


o Running HOMOLOBIND
--------------------

1. Annotating binding site similarities.

 USAGE: homolobind.pl -ass_fn SUPERFAMILY_assignment_file
  [-cluster_fl 1] to run on an SGE cluster
  [-out_fn output_file] [-err_fn error_file]
  [-matrix_fn substitution_matrix_file] matblas format substitution matrix

 For example: 

  perl homolobind.pl -ass_fn hs.ass -out_fn hs.out -err_fn hs.err

  This command would process the domain assignments listed in
   `hs.ass' and list all detected binding site similarities 
   in the `hs.out' file, with errors reported in `hs.err'.

2. Creating a summary table from HOMOLOBIND output file.

  perl homolobind.pl -summarize_results hs.out

   This command summarizes the annotation results: numbers of proteins,
   domains, and residues covered by domain, peptide, and ligand binding
   sites. This command also counts predicted bifunctional residues with
   significant similarity to both ligand and protein binding sites.

3. Create a diagram depiciting predictions and domain architecture.

  perl homolobind.pl -plot_annotations 1 -ass_fn ASSIGNMENT_FILE
       -results_fn HOMOLOBIND_RESULT_FILE -seq_id SEQUENCE_IDENIFIER
       -seq_id_fn SEQUENCE_ID_LIST_FILE

  This command creates a postscript diagrams depiciting the annotated
  binding sites for the sequence specified by -seq_id XX or the
  sequences listed in the file specified by -seq_id_fn.


o HOMOLOBIND output
-------------------

 Each output line describes the similarity of a target domain to a
 structurally characterized binding site. The output is tab-delimited:
 1. Sequence identifier
 2. Domain residue range
 3. SCOP classification level (superfamily or family)
 4. SCOP classification
 5. Template binding site type: p=peptide, P=protein domain, L=ligand;
                                exp=exposed residues.
 6. Template binding site identifier.
 7. Template binding site description
 8. List of residues aligned to template binding site
 9. Percent sequence identity over template binding site residues
10. Percent sequence similarity over template binding site residues; 
     Similarity is a Karlin-Brocchieri normalized and rescaled BLOSUM62 score,
     described below.
11. Number of identical binding site positions
12. Number of aligned binding site positions
13. Number of residues in template binding site
14. Fraction of template binding site residues aligned to the target sequence
15. Number of identical residues across whole domain
16. Length of whole domain alignment
17. Percent sequence identity across whole domain
18. Percent sequence similarity across whole domain

 For those target domains where similar binding sites were not found,
 there will be a line with similar field as above, however in place of
 fields 6-18, one of the following reasons will be provided:

 1. 'no template'

 2. 'sub-threshold template' - template binding sites are available
    in the SCOP family, but are below the sequence identity threshold.

 3. 'domain family not covered by ASTRAL'. ASTRAL/ASTEROIDS only covers
    domains in classes a-g. Classes h-k are not covered.

 4. 'ERROR in merging SUPFAM/ASTRAL alignments'.


 * Sequence similarity score:

   To provide a more graded measure of template--target similarity, a
   sequence similarity score is computed using a normalized version of
   the BLOSUM62 substitution matrix (Henikoff and Hennikof, PNAS 1992)
   as suggested by Karlin and Brocchieri (J Bacteriol 1996) and
   rescaled to range from zero to one:

   sim(aa_i,aa_j) = (BLOSUM62(aa_i,aa_j)) /
                     sqrt(BLOSUM62(aa_i, aa_i) * BLOSUM62(aa_j, aa_j)) + 1) / 2


o Input File Formats
--------------------

 * SUPERFAMILY domain assignment file
   - see test/example_domains.ass for an example

   Each line has the following fields:
   1. species identifier
   2. target sequence id
   3. SUPERFAMILY model_id
   4. target domain residue range
   5. superfamily-level domain assignment e-value
   6. SUPERFAMILY alignment string
   7. family-level domain assignment e-value
   8. SCOP px_id
   9. SCOP fa_id
 
 * matrix_fn - substitution matrix in matblas format,
   eg, http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt


o Citing HOMOLOBIND
-------------------

 Proteome-wide prediction of overlapping protein and ligand binding sites
   using structure. Davis FP. Molecular Biosystems (2011) 7: 545-557.
   http://dx.doi.org/10.1039/C0MB00200C

 HOMOLOBIND uses data from the following sources:
 * ASTRAL/ASTEROIDS: Chandonia, et al. Nucleic Acids Res (2004) 32:D189-92.
 * LIGBASE: Stuart, et al. Bioinformatics (2002) 8(1):200-1.
 * PIBASE: Davis and Sali. Bioinformatics (2005) 21(9):1901-7.
 * SCOP: Murzin, et al. J Mol Biol (1995) 247(4):536-40.
 * SUPERFAMILY: Wilson, et al. Nucleic Acids Res (2009) 37:D380-6.


o Contact information
---------------------
 Fred P. Davis
 email: [email protected]
 web:   http://fredpdavis.com


This file is part of HOMOLOBIND.

HOMOLOBIND is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

HOMOLOBIND is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with HOMOLOBIND.  If not, see <http://www.gnu.org/licenses/>.

----------------------------------------------------------------------------