-
Notifications
You must be signed in to change notification settings - Fork 4
Breed4Food data processing & analysis
- Download the Pig QTLdb data in GFF3 format. This data set is part of a larger collection of trait mappings in animals called Animal QTLdb. For detailed description of the data set (metadata) visit this page.
Example lines in this GFF file:
Chr.1 Animal QTLdb Meat_and_Carcass_Association 155977038 156168386 . . . QTL_ID=10628;Name=Backfat at tenth rib;Abbrev=10RIBBFT;PUBMED_ID=12807784;trait_ID=7;trait=Backfat at tenth rib;FlankMarkers;VTO_name=subcutaneous adipose thickness;CMO_name=back fat thickness, 10th rib;Map_Type=Linkage;Model=Mendelian;peak_cM=76;Significance=Significant;P-value=<0.0001;Additive_Effect=0.06;gene_ID=397359;gene_IDsrc=NCBIgene Chr.1 Animal QTLdb Reproduction_Association 21576258 21576298 . . . QTL_ID=31297;Name=Teat number;Abbrev=TNUM;PUBMED_ID=25104247;trait_ID=167;trait=Teat number;breed=large white,pietrain,Landrace;FlankMarkers=rs80866387;VTO_name=nipple quantity;CMO_name=teat number;Map_Type=Genome;Model=Mendelian;Test_Base=-;peak_cM=10.40;Significance=Significant;P-value=<0.05;gene_ID=100625610;gene_IDsrc=NCBIgene
See the corresponding QTL entry in the pigQTLdb. Note: The QTL symbol 10RIBBFT in this entry corresponds to Gene symbol 10RIBBFT in the NCBI's Gene database entry (Gene ID: 100529885); however, the GFF line above does not contain a reference to this gene but rather to Gene ID: 397359, which refers to a candidate gene MC4R for this QTL entry. So, the question is where does the 10RIBBFT gene come from (e.g. user input)?
The attributes field (9th column) of the GFF file contains cross-references a.o. to Livestock Product Trait Ontology (LPT, issue #15), Vertebrate Trait Ontology (VT, issue #16) and Clinical Measurement Ontology (CMO, issue #17). Unfortunately, LPT/VT/CMO IDs (part of URIs) are not provided along the human-readable names so one needs to add/reconcile these. Most of the GFF lines contain reference to pig breed(s) (a comma-separated list), which can be mapped to Livestock Breed Ontology (LBO).
Note: When using the Animal QTLdb API the GFF line returned for this QTL differs from the GFF file (issue #8):
curl "www.animalgenome.org/cgi-bin/QTLdb/API/ifetch?q=10628&s=pig&m=GFF"
... Chr1 Animal QTLdb Association 155977038 156168386 . qtlID=10628;name=Backfat at tenth rib;symbl=10RIBBFT;linkageMap=(76);flankMarkers=,,,,;pubmedID=12807784;traitClass=Meat_and_Carcass;vt=VT:1000135;lpt=;cmo=CMO:0000433 ...
There are 289 QTLs associated with terms teat number
and/or nipple quantity
and are found across 19 out of 20 chromosomes:
#N #chr 56 7 51 8 23 1 20 12 19 4 18 3 17 6 14 2 11 16 11 15 11 10 10 5 6 11 5 17 5 14 4 9 4 18 3 13 1 X
The trait is referred in 22 publications (see PUBMED_IDs in the attributes field of the GFF file).
zcat qdwnld58435SMJG.txt.gz|grep nipple|perl -lne 'print "[PMID:$1](http://www.ncbi.nlm.nih.gov/pubmed/$1)" if /PUBMED_ID=(\d+)/'|sort -u
PMID:11048919 PMID:11167524 PMID:11263822 PMID:11583418 PMID:12606397 PMID:15537760 PMID:16100052 PMID:16293122 PMID:17032781 PMID:17498628 PMID:18219525 PMID:18651874 PMID:19226448 PMID:21108822 PMID:22443659 PMID:24456574 PMID:24981054* PMID:25104247 PMID:25158056 PMID:25178368 PMID:26202474 PMID:26830357
Note: PMID:24981054 includes suppl. tables in MS Excel files: list of genes in QTLs (Ensembl gene IDs), QTL studies (N=19 instead of 22!)
Note: Use of non-standard feature types (keys) in the 3rd column (see DDBJ/ENA/GenBank Feature Table Definition for reference), namely
321 Meat_and_Carcass_QTL 96 Meat_and_Carcass_eQTL 73 Reproduction_QTL * 60 Meat_and_Carcass_Association 51 Production_QTL 46 Health_QTL 38 Exterior_QTL 21 Health_Association 15 Exterior_Association 14 Production_Association 9 Reproduction_Association *
These features could be mapped to misc_feature FT key. Feature types indicated with '*' are associated with the trait of interest.
- Data cleaning & transformation into RDF graph using OpenRefine+RDF extension and/or Virtuoso R2RML processor.
The following sections are nice to know but not used further in the Virtuoso-based platform (deprecated).
- Download pig (Sus scrofa, assembly Sscrofa10.2) genome annotation in GFF from the NCBI's Genome database. There are two different download URLs for the GFFs. Alternatively, use genome annotations directly in RDF distribution from Ensembl. Note: How similar/different are these data sets?
Example lines in the GFF file [1,2]:
NC_010443.4 RefSeq region 1 315321322 . + . ID=id0;Dbxref=taxon:9823;Name=1;breed=mixed;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA NC_010443.4 Gnomon gene 13708 21219 . - . ID=gene0;Dbxref=GeneID:100620291;Name=LOC100620291;gbkey=Gene;gene=LOC100620291;gene_biotype=protein_coding;partial=true;start_range=.,13708
- Transform this GFF file into RDF graph.
Feature types/counts in the GFF files ('*' indicates non-standard feature key; '!' indicates diff. counts):
zcat GCF_000003025.5_Sscrofa10.2_genomic.gff.gz|grep -v '#'|cut -f 3|sort|uniq -c|sort -nr
564049 exon 456586 CDS 47405 mRNA 43266 cDNA_match (*) 39488 gene 16715 ncRNA 4583 region (*) 2044 transcript (*, a more specific term?) 501 tRNA 343 primary_transcript 290 match (*) 91 V_gene_segment (*, V_region?) 9 C_gene_segment (*, C_region?) 4 rRNA 1 repeat_region 1 D_loop (*, D-loop)
In total 1175376 features.
zcat ref_Sscrofa10.2_top_level.gff3.gz|grep -v '#'|cut -f 3|sort|uniq -c|sort -nr
563942 exon (!) 456477 CDS (!) 47405 mRNA 43266 cDNA_match 39488 gene 16715 ncRNA 4583 region 2044 transcript 501 tRNA 343 primary_transcript 290 match 91 V_gene_segment 9 C_gene_segment 4 rRNA 1 repeat_region 1 D_loop
In total 1175160 features.
Note: The GFF files differ in the total numbers of features. This begs the question of which one to use (both NCBI genome build: Sscrofa10.2 and accession: GCF_000003025.5, annotation release ?/105) Moreover, none of these files contain intron and _[five_prime|tree_prime]UTR feature types.
- Download the pig genome annotations in RDF from EnsEMBL (release 86).
Note: The numbers of SO feature types differ significantly from the GFF files above.
209609 exon
25511 mRNA
21607 protein_coding_gene
1092 snRNA
1092 snRNA_gene
958 pseudogene
877 miRNA
877 miRNA_gene
640 snoRNA_gene
640 snoRNA
618 processed_transcript
474 aberrant_processed_transcript
347 NMD_transcript_variant
229 processed_pseudogene
185 RNA
171 rRNA_gene
171 rRNA
47 lincRNA
35 lincRNA_gene
22 mt_gene
2 pseudogenic_transcript
2 C_gene_segment
ODEX4all