Merge pull request #10 from DanielFaulkner/docsupdate

Additional documentation and surplus space replace to detab utility.
DanielFaulkner · Feb 10, 2021 · 5cd7e5a · 5cd7e5a
2 parents 58ad283 + 0d9221b
commit 5cd7e5a
Show file tree

Hide file tree

Showing 5 changed files with 114 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -7,6 +7,8 @@ Utilities:
 - Annotation Tabulator - Converts a GTF/GFF file from using both tab and semicolon separators to only tab characters.
 - Annotation Features - Compares two annotation files and adds columns detailing overlapping features.
 - Annotation Viewer - Compares two annotation files and displays a genomic representation of their positions.
+- Annotation Sorter - Sorts entries in an annotation file by position.
+- Annotation Atlas - Adds columns from The Human Protein Atlas dataset to an annotation file.
 
 Combined these utilities should enable an annotation file to be filtered and converted into a format suitable for further analysis. Any interesting annotations can then be formatted to include details on overlapping/nearby features or viewed alongside another annotation file. This can be particularly useful when used with a genomic reference sequence annotation file.
 
@@ -17,8 +19,8 @@ The 'doc' folder contains documentation on the use of these utilities and the 't
 **NOTE:** While care has been made to remove errors the outputs of these utilities come with no guarantee of accuracy. Please check the outputs of these utilities are accurate and suitable before using in research or production settings.
 
 ## Requirements
-- Python, version 3 
-- Linux - however other operating systems may work with little or no modification needed. 
+- Python, version 3
+- Linux - however other operating systems may work with little or no modification needed.
 
 ## Author
 

diff --git a/docs/userguide.md b/docs/userguide.md
@@ -8,7 +8,7 @@ The annotation utilities provides a few tools in aid in the manipulation and ins
 - DFAM
 - GTF/GFF
 
-All files must use tab separation and comments and headers must be indicated with a preceeding '#' character. Each utility will try to determine the filetype based on it's file extension or content. If a utility is unable to determine the file type of an input this can be corrected by changing the file extension to either .bed,.csc,.fam or .gtf. Alternatively it maybe possible to indicate this as an argument when starting a utility.
+All files must use tab separation and comments and headers must be indicated with a preceding '#' character. Each utility will try to determine the file type based on it's file extension or content. If a utility is unable to determine the file type of an input this can be corrected by changing the file extension to either .bed,.csc,.fam or .gtf. Alternatively it maybe possible to indicate this as an argument when starting a utility.
 
 **Requirements:**  
 Python (version 3) is required.  
@@ -76,24 +76,74 @@ Tabulating a GTF file so all fields are separated by tab characters.
 
 python3 annotabify.py test/sampledata/GTF.gtf GTFtabulated.gtf
 
-## Annotation overlapping feature (annofeat.py)
+## Annotation sorter (annosort.py)
 
-This utility reads an annotation file and looks for features in a reference annotation file which overlap. These are then added as additional tab separated columns in the output. One column for the feature type and one for the feature ID. A margin can be specified to extend the overlapping region to include neighbouring features. Additionally the output can be configured to return all results or just those with the highest priority, transcripts/exons by default, see comments in libAnnoFeat.py file for instructions on modifying the priority list.
-This utility can compare any two support annotation files but is most likely to be used to compare the results of an analysis against a genome annotation file.
-NOTE: This utility has received very limited optimisation work, therefore expect the utility to take a while when used with larger files.
+This utility checks the order of annotation entries within a file and can either show the status of the file using the -s/--status option or if an output file is provided using the -o/--output option sort the file. The entries are grouped by chromosome, alphabetically, and sorted by the alignment start positions.
+
+**Arguments:**  
+Input filename 			(required, filepath)  
+-o/--output:	Output file name to use.  
+-s/--status:	View the current sort status of the annotation file.  
+
+**Example:**  
+Sort an annotation file.
+
+python3 annosort.py test/sampledata/GTF.gtf -o sortedfile.gtf
+
+View the sort status of an annotation file.
+
+python3 annosort.py test/sampledata/GTF.gtf -s
+
+## Annotation closest/overlapping feature (annofeat.py)
+
+This utility reads an annotation file and looks for features in a reference annotation file which are nearby or overlap. These are then added as additional tab separated columns in the output.  
+The utility has two modes, the default detection method returns the closest features before and following an annotation as well as those overlapping. The features to include when looking for the closest feature can be provided as a comma separated list using the -f/--features option. This adds columns to the annotation file for the closest feature type, name, strand and distance from annotation, in each direction and within.  
+The alternative method detects only features overlapping an annotation, although this region can be extended to include neighbouring features with the -m/--margins option. This adds columns for overlapping feature types and feature names to the annotation file.  
+The output can be configured to return all results using the -a/--all option or just those with the highest priority. The default option is to only include transcripts and exons. This can be modified, see comments in libAnnoFeat.py file for instructions on how to modify the priority list.
+This utility can compare any two support annotation files but is most likely to be used to compare the results of an analysis against a genome annotation file.  
+NOTE: This utility will run significantly slower with files which are not sorted by genomic position.  
 
 **Arguments:**  
 Query filename 			(required, filepath)  
 Reference filename  (required, filepath)  
 Output filename 		(required, filepath)  
--m/--margin:  Additional margin (in bps) to include in the search for overlapping features.  
--a/--all:     Returns all overlapping features as a semicolon separated list.  
--t/--title:   Title to use for the column header (if a header line is present).  
+-m/--margin:  Additional margin (in bps) to include in the search for overlapping features (use with -o).  
+-a/--all:       Returns all overlapping features as a semicolon separated list.  
+-t/--title:     Title to use for the column header, if a header line is present (use with -o).  
+-o/--overlap:   Only returns those results which overlap with an annotation.
+-s/--sense:     Return the results formatted using sense/antisense direction.
+-f/--features:  Features to include when looking for the closest feature.
 
 **Example:**  
-Searching for overlapping features with itself, an unrealistic use case.
+Searching for overlapping features + 10 base pairs with itself, an unrealistic use case.
+
+python3 annofeat.py test/sampledata/GTF.gtf test/sampledata/GTF.gtf output.gtf -o -a -t OverlappingPlus10 -m 10
+
+Searching for closest transcripts, exons and UTR features using a hypothetical genome annotation file.
+
+python3 annofeat.py test/sampledata/GTF.gtf genefeatures.gtf output.gtf -a -f transcript,exon,UTR
 
-python3 annofeat.py test/sampledata/GTF.gtf test/sampledata/GTF.gtf output.gtf -a -t OverlappingPlus10 -m 10
+## Annotation atlas (annoatlas.py)
+
+This utility adds columns from a Human Protein Atlas dataset to an annotation file. If the annotation file format does not include gene names by standard this can be defined manually using the -c/--column option, starting at column 0. The columns to add from the Human Protein Atlas dataset can be defined using the -a/--atlascols option. This option supports a comma separated list of column numbers, column names or regular expression search terms, with the -r/--regex option.  
+NOTE: Only the TSV formatted complete dataset is compatible. The gene name, synonym and Ensembl ID columns are used for detecting a match with an annotation file entry.
+
+**Arguments:**  
+Input filename 			(required, filepath)  
+Atlas filename 			(required, filepath)  
+Output filename 		(required, filepath)  
+-c/--column:    Annotation file column containing gene names.  
+-r/--regex:	    Use regular expression search terms for atlas columns.  
+-a/--atlascols: Human Protein Atlas columns to add to the annotation file.
+
+**Example:**  
+Add specific columns to an annotation file.
+
+python3 annoatlas.py test/sampledata/atlastest.bed proteinatlas.tsv combinedoutput.tsv -c 6 -a 6,7,Chromosome
+
+Add columns containing the word RNA based using a regular expression search term.
+
+python3 annoatlas.py test/sampledata/atlastest.bed proteinatlas.tsv combinedoutput.tsv -c 6 -a RNA -r
 
 ## Annotation viewer (annoview.py)
 
@@ -133,6 +183,38 @@ query r/l	Moves the query view to the right or left (if lines extend beyond the
 query unedited	Shows the query annotations unedited  
 query edited	Shows the query annotations in a standardised format  
 
+## Example project
+This section includes a short example project using these tools to identify how many annotations have a, potentially, intact promoter and follow a prognostic proteins.  
+Required files:
+- Human Protein Atlas as proteinatlas.tsv
+- NCBI Reference Sequence annotation as hg38.gtf
+- Annotation file as DFAM.tsv
+
+First filter the annotation file to find features with an intact promoter.
+- python3 annofilter.py -i DFAM.tsv -o DFAMfiltered.tsv --minpromoter 100 --defaultend 1000
+
+Convert the file to a GTF file (not needed, but here as an example)
+- python3 annoconv.py -i DFAMfiltered.tsv -o GTFfiltered.gtf -c GTF
+
+To speed up comparisons between feature files ensure NCBI reference annotation is sorted by genomic position.
+- python3 annosort.py hg38.gtf -o hg38sorted.gtf
+
+Convert the annotation file to a standard tsv format.
+- python3 annotabify.py GTFfiltered.gtf GTFtabulated.gtf
+
+Add information on the closest feature
+- python3 annofeat.py GTFtabulated.gtf hg38sorted.gtf GTFfeatures.gtf -s
+
+Add columns from the Human Protein Atlas dataset
+- python3 annoatlas.py GTFfeatures.gtf proteinatlas.tsv GTFatlas.gtf -c 13 -a Pathology -r
+
+Convert back to a standard GTF formated file
+- python3 annotabify.py GTFatlas.gtf GTFdetabulated.gtf -r
+
+
+At the end of these commands the GTF file should contain information on the closest features and relevant prognostic information. This can then be sorted or filtered in a spreadsheet software package, using the GTFatlas.gtf file as a TSV file type, or fed into further analysis software.    
+NOTE: When adding columns to annotation files it is important to preserve the header and file extensions if the file is to be used by another utility, so the software recognises which columns contain essential information.
+
 ## Useful resources
 This section contains links to some useful resources applicable to both using the utilities and on where to find suitable annotation files.  
 
@@ -142,6 +224,7 @@ This section contains links to some useful resources applicable to both using th
 - GTF formatted genome annotation from [UCSC](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/)
 - GTF formatted genome annotation from [NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/)
 - UCSC repeat annotation files from [UCSC's table browser website](https://genome.ucsc.edu/cgi-bin/hgTables)
+- The Human Protein Atlas dataset from [The Human Protein Atlas website](https://www.proteinatlas.org/about/download)
 
 **UCSC Table browser settings:**  
 Open Reading Frame (ORF) positions:  

diff --git a/lib/libAnnoFeat.py b/lib/libAnnoFeat.py
@@ -254,11 +254,11 @@ def featureClosestAddColumn(annofileobj, reftrackobj, outfileobj, senseorder=0,
     while line[0]=="#":
         header = line
         line = annofileobj.readline()
-    extracolstmp = "\t{} Name\t{} Type\t{} Strand\t{} Distance\tWithin Name\tWithin Strand\tWithin Type\tWithin Distance\t{} Name\t{} Type\t{} Strand\t{} Distance\n"
+    extracolstmp = "\t{} Name\t{} Type\t{} Strand\t{} Distance\tWithin Name\tWithin Type\tWithin Strand\tWithin Distance\t{} Name\t{} Type\t{} Strand\t{} Distance\n"
     if senseorder:
         extracols = extracolstmp.format("AntiSense","AntiSense","AntiSense","AntiSense","Sense","Sense","Sense","Sense")
     else:
-        extracols = extracolstmp.format("Preceeding","Preceeding","Preceeding","Preceeding","Following","Following","Following","Following")
+        extracols = extracolstmp.format("Preceding","Preceding","Preceding","Preceding","Following","Following","Following","Following")
     newheader = header.strip() + extracols
     outfileobj.write(newheader)    # NOTE: If multiple preceeding comment lines present they will not be preserved using this method
     # For each data line open the annotation position and perform comparison

diff --git a/lib/libAnnoTabify.py b/lib/libAnnoTabify.py
@@ -9,6 +9,7 @@
 standardgtfheader = "#seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tattributes"
 unknownkey = "unknown"
 divider = " "   # GTF Key/Value separator character. Some software outputs '=' but ' ' is standard.
+detabspacechar = "_"    # When using the detab function replace additional spaces with this char
 
 # Principly designed following the GTF specification. (https://mblab.wustl.edu/GTF22.html)
 # But could be adapted to support other file formats in the future.
@@ -76,7 +77,10 @@ def deTabify(infileobj, outfileobj, addheader=0):
     if header:
         header = header.strip()             # Remove any preceeding/following tabs
         if len(header.split('\t'))>8:
-            headings = header.split('\t')[8:]
+            headingsUnmod = header.split('\t')[8:]
+            headings = []
+            for item in headingsUnmod:      # Replace any additional space characters
+                headings.append(item.replace(" ",detabspacechar))
     # Write the header
     if addheader>1:
         newheader = standardgtfheader+'\n'

diff --git a/test/sampledata/atlasInput.bed b/test/sampledata/atlasInput.bed
@@ -0,0 +1,10 @@
+chr1	50619126	50619902	L1P4a_5end	777	-	FAF1
+chr1	50620657	50621383	L1P4a_5end	604	-	UNKNOWN
+chr1	50621700	50621883	L1P4a_5end	117	-	.
+chr1	50622194	50622527	L1P4a_5end	260	-	DPM1
+chr1	55679656	55680125	L1P4a_5end	391	-	PACE-1
+chr1	56343575	56344988	L1P4a_5end	1082	+	ENSG00000000938
+chr1	57682374	57684857	L1P4a_5end	2334	+	 LAS1L
+chr1	68576796	68577787	L1P4a_5end	636	+	FLJ12525
+chr1	70625362	70625721	L1P4a_5end	301	+	ENSG00000005243
+chr1	70743911	70745541	L1P4a_5end	1314	-	FTDCR1B