Skip to content

Commit

Permalink
Merge pull request #10 from DanielFaulkner/docsupdate
Browse files Browse the repository at this point in the history
Additional documentation and surplus space replace to detab utility.
  • Loading branch information
DanielFaulkner authored Feb 10, 2021
2 parents 58ad283 + 0d9221b commit 5cd7e5a
Show file tree
Hide file tree
Showing 5 changed files with 114 additions and 15 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Utilities:
- Annotation Tabulator - Converts a GTF/GFF file from using both tab and semicolon separators to only tab characters.
- Annotation Features - Compares two annotation files and adds columns detailing overlapping features.
- Annotation Viewer - Compares two annotation files and displays a genomic representation of their positions.
- Annotation Sorter - Sorts entries in an annotation file by position.
- Annotation Atlas - Adds columns from The Human Protein Atlas dataset to an annotation file.

Combined these utilities should enable an annotation file to be filtered and converted into a format suitable for further analysis. Any interesting annotations can then be formatted to include details on overlapping/nearby features or viewed alongside another annotation file. This can be particularly useful when used with a genomic reference sequence annotation file.

Expand All @@ -17,8 +19,8 @@ The 'doc' folder contains documentation on the use of these utilities and the 't
**NOTE:** While care has been made to remove errors the outputs of these utilities come with no guarantee of accuracy. Please check the outputs of these utilities are accurate and suitable before using in research or production settings.

## Requirements
- Python, version 3
- Linux - however other operating systems may work with little or no modification needed.
- Python, version 3
- Linux - however other operating systems may work with little or no modification needed.

## Author

Expand Down
103 changes: 93 additions & 10 deletions docs/userguide.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The annotation utilities provides a few tools in aid in the manipulation and ins
- DFAM
- GTF/GFF

All files must use tab separation and comments and headers must be indicated with a preceeding '#' character. Each utility will try to determine the filetype based on it's file extension or content. If a utility is unable to determine the file type of an input this can be corrected by changing the file extension to either .bed,.csc,.fam or .gtf. Alternatively it maybe possible to indicate this as an argument when starting a utility.
All files must use tab separation and comments and headers must be indicated with a preceding '#' character. Each utility will try to determine the file type based on it's file extension or content. If a utility is unable to determine the file type of an input this can be corrected by changing the file extension to either .bed,.csc,.fam or .gtf. Alternatively it maybe possible to indicate this as an argument when starting a utility.

**Requirements:**
Python (version 3) is required.
Expand Down Expand Up @@ -76,24 +76,74 @@ Tabulating a GTF file so all fields are separated by tab characters.

python3 annotabify.py test/sampledata/GTF.gtf GTFtabulated.gtf

## Annotation overlapping feature (annofeat.py)
## Annotation sorter (annosort.py)

This utility reads an annotation file and looks for features in a reference annotation file which overlap. These are then added as additional tab separated columns in the output. One column for the feature type and one for the feature ID. A margin can be specified to extend the overlapping region to include neighbouring features. Additionally the output can be configured to return all results or just those with the highest priority, transcripts/exons by default, see comments in libAnnoFeat.py file for instructions on modifying the priority list.
This utility can compare any two support annotation files but is most likely to be used to compare the results of an analysis against a genome annotation file.
NOTE: This utility has received very limited optimisation work, therefore expect the utility to take a while when used with larger files.
This utility checks the order of annotation entries within a file and can either show the status of the file using the -s/--status option or if an output file is provided using the -o/--output option sort the file. The entries are grouped by chromosome, alphabetically, and sorted by the alignment start positions.

**Arguments:**
Input filename (required, filepath)
-o/--output: Output file name to use.
-s/--status: View the current sort status of the annotation file.

**Example:**
Sort an annotation file.

python3 annosort.py test/sampledata/GTF.gtf -o sortedfile.gtf

View the sort status of an annotation file.

python3 annosort.py test/sampledata/GTF.gtf -s

## Annotation closest/overlapping feature (annofeat.py)

This utility reads an annotation file and looks for features in a reference annotation file which are nearby or overlap. These are then added as additional tab separated columns in the output.
The utility has two modes, the default detection method returns the closest features before and following an annotation as well as those overlapping. The features to include when looking for the closest feature can be provided as a comma separated list using the -f/--features option. This adds columns to the annotation file for the closest feature type, name, strand and distance from annotation, in each direction and within.
The alternative method detects only features overlapping an annotation, although this region can be extended to include neighbouring features with the -m/--margins option. This adds columns for overlapping feature types and feature names to the annotation file.
The output can be configured to return all results using the -a/--all option or just those with the highest priority. The default option is to only include transcripts and exons. This can be modified, see comments in libAnnoFeat.py file for instructions on how to modify the priority list.
This utility can compare any two support annotation files but is most likely to be used to compare the results of an analysis against a genome annotation file.
NOTE: This utility will run significantly slower with files which are not sorted by genomic position.

**Arguments:**
Query filename (required, filepath)
Reference filename (required, filepath)
Output filename (required, filepath)
-m/--margin: Additional margin (in bps) to include in the search for overlapping features.
-a/--all: Returns all overlapping features as a semicolon separated list.
-t/--title: Title to use for the column header (if a header line is present).
-m/--margin: Additional margin (in bps) to include in the search for overlapping features (use with -o).
-a/--all: Returns all overlapping features as a semicolon separated list.
-t/--title: Title to use for the column header, if a header line is present (use with -o).
-o/--overlap: Only returns those results which overlap with an annotation.
-s/--sense: Return the results formatted using sense/antisense direction.
-f/--features: Features to include when looking for the closest feature.

**Example:**
Searching for overlapping features with itself, an unrealistic use case.
Searching for overlapping features + 10 base pairs with itself, an unrealistic use case.

python3 annofeat.py test/sampledata/GTF.gtf test/sampledata/GTF.gtf output.gtf -o -a -t OverlappingPlus10 -m 10

Searching for closest transcripts, exons and UTR features using a hypothetical genome annotation file.

python3 annofeat.py test/sampledata/GTF.gtf genefeatures.gtf output.gtf -a -f transcript,exon,UTR

python3 annofeat.py test/sampledata/GTF.gtf test/sampledata/GTF.gtf output.gtf -a -t OverlappingPlus10 -m 10
## Annotation atlas (annoatlas.py)

This utility adds columns from a Human Protein Atlas dataset to an annotation file. If the annotation file format does not include gene names by standard this can be defined manually using the -c/--column option, starting at column 0. The columns to add from the Human Protein Atlas dataset can be defined using the -a/--atlascols option. This option supports a comma separated list of column numbers, column names or regular expression search terms, with the -r/--regex option.
NOTE: Only the TSV formatted complete dataset is compatible. The gene name, synonym and Ensembl ID columns are used for detecting a match with an annotation file entry.

**Arguments:**
Input filename (required, filepath)
Atlas filename (required, filepath)
Output filename (required, filepath)
-c/--column: Annotation file column containing gene names.
-r/--regex: Use regular expression search terms for atlas columns.
-a/--atlascols: Human Protein Atlas columns to add to the annotation file.

**Example:**
Add specific columns to an annotation file.

python3 annoatlas.py test/sampledata/atlastest.bed proteinatlas.tsv combinedoutput.tsv -c 6 -a 6,7,Chromosome

Add columns containing the word RNA based using a regular expression search term.

python3 annoatlas.py test/sampledata/atlastest.bed proteinatlas.tsv combinedoutput.tsv -c 6 -a RNA -r

## Annotation viewer (annoview.py)

Expand Down Expand Up @@ -133,6 +183,38 @@ query r/l Moves the query view to the right or left (if lines extend beyond the
query unedited Shows the query annotations unedited
query edited Shows the query annotations in a standardised format

## Example project
This section includes a short example project using these tools to identify how many annotations have a, potentially, intact promoter and follow a prognostic proteins.
Required files:
- Human Protein Atlas as proteinatlas.tsv
- NCBI Reference Sequence annotation as hg38.gtf
- Annotation file as DFAM.tsv

First filter the annotation file to find features with an intact promoter.
- python3 annofilter.py -i DFAM.tsv -o DFAMfiltered.tsv --minpromoter 100 --defaultend 1000

Convert the file to a GTF file (not needed, but here as an example)
- python3 annoconv.py -i DFAMfiltered.tsv -o GTFfiltered.gtf -c GTF

To speed up comparisons between feature files ensure NCBI reference annotation is sorted by genomic position.
- python3 annosort.py hg38.gtf -o hg38sorted.gtf

Convert the annotation file to a standard tsv format.
- python3 annotabify.py GTFfiltered.gtf GTFtabulated.gtf

Add information on the closest feature
- python3 annofeat.py GTFtabulated.gtf hg38sorted.gtf GTFfeatures.gtf -s

Add columns from the Human Protein Atlas dataset
- python3 annoatlas.py GTFfeatures.gtf proteinatlas.tsv GTFatlas.gtf -c 13 -a Pathology -r

Convert back to a standard GTF formated file
- python3 annotabify.py GTFatlas.gtf GTFdetabulated.gtf -r


At the end of these commands the GTF file should contain information on the closest features and relevant prognostic information. This can then be sorted or filtered in a spreadsheet software package, using the GTFatlas.gtf file as a TSV file type, or fed into further analysis software.
NOTE: When adding columns to annotation files it is important to preserve the header and file extensions if the file is to be used by another utility, so the software recognises which columns contain essential information.

## Useful resources
This section contains links to some useful resources applicable to both using the utilities and on where to find suitable annotation files.

Expand All @@ -142,6 +224,7 @@ This section contains links to some useful resources applicable to both using th
- GTF formatted genome annotation from [UCSC](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/)
- GTF formatted genome annotation from [NCBI](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/)
- UCSC repeat annotation files from [UCSC's table browser website](https://genome.ucsc.edu/cgi-bin/hgTables)
- The Human Protein Atlas dataset from [The Human Protein Atlas website](https://www.proteinatlas.org/about/download)

**UCSC Table browser settings:**
Open Reading Frame (ORF) positions:
Expand Down
4 changes: 2 additions & 2 deletions lib/libAnnoFeat.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,11 +254,11 @@ def featureClosestAddColumn(annofileobj, reftrackobj, outfileobj, senseorder=0,
while line[0]=="#":
header = line
line = annofileobj.readline()
extracolstmp = "\t{} Name\t{} Type\t{} Strand\t{} Distance\tWithin Name\tWithin Strand\tWithin Type\tWithin Distance\t{} Name\t{} Type\t{} Strand\t{} Distance\n"
extracolstmp = "\t{} Name\t{} Type\t{} Strand\t{} Distance\tWithin Name\tWithin Type\tWithin Strand\tWithin Distance\t{} Name\t{} Type\t{} Strand\t{} Distance\n"
if senseorder:
extracols = extracolstmp.format("AntiSense","AntiSense","AntiSense","AntiSense","Sense","Sense","Sense","Sense")
else:
extracols = extracolstmp.format("Preceeding","Preceeding","Preceeding","Preceeding","Following","Following","Following","Following")
extracols = extracolstmp.format("Preceding","Preceding","Preceding","Preceding","Following","Following","Following","Following")
newheader = header.strip() + extracols
outfileobj.write(newheader) # NOTE: If multiple preceeding comment lines present they will not be preserved using this method
# For each data line open the annotation position and perform comparison
Expand Down
6 changes: 5 additions & 1 deletion lib/libAnnoTabify.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
standardgtfheader = "#seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tattributes"
unknownkey = "unknown"
divider = " " # GTF Key/Value separator character. Some software outputs '=' but ' ' is standard.
detabspacechar = "_" # When using the detab function replace additional spaces with this char

# Principly designed following the GTF specification. (https://mblab.wustl.edu/GTF22.html)
# But could be adapted to support other file formats in the future.
Expand Down Expand Up @@ -76,7 +77,10 @@ def deTabify(infileobj, outfileobj, addheader=0):
if header:
header = header.strip() # Remove any preceeding/following tabs
if len(header.split('\t'))>8:
headings = header.split('\t')[8:]
headingsUnmod = header.split('\t')[8:]
headings = []
for item in headingsUnmod: # Replace any additional space characters
headings.append(item.replace(" ",detabspacechar))
# Write the header
if addheader>1:
newheader = standardgtfheader+'\n'
Expand Down
10 changes: 10 additions & 0 deletions test/sampledata/atlasInput.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
chr1 50619126 50619902 L1P4a_5end 777 - FAF1
chr1 50620657 50621383 L1P4a_5end 604 - UNKNOWN
chr1 50621700 50621883 L1P4a_5end 117 - .
chr1 50622194 50622527 L1P4a_5end 260 - DPM1
chr1 55679656 55680125 L1P4a_5end 391 - PACE-1
chr1 56343575 56344988 L1P4a_5end 1082 + ENSG00000000938
chr1 57682374 57684857 L1P4a_5end 2334 + LAS1L
chr1 68576796 68577787 L1P4a_5end 636 + FLJ12525
chr1 70625362 70625721 L1P4a_5end 301 + ENSG00000005243
chr1 70743911 70745541 L1P4a_5end 1314 - FTDCR1B

0 comments on commit 5cd7e5a

Please sign in to comment.