Skip to content

Breed4Food data processing & analysis

Arnold Kuzniar edited this page Feb 20, 2017 · 2 revisions
  • Download the Pig QTLdb data in GFF3 format. This data set is part of a larger collection of trait mappings in animals called Animal QTLdb. For detailed description of the data set (metadata) visit this page.

Example lines in this GFF file:

Chr.1	Animal QTLdb	Meat_and_Carcass_Association	155977038	156168386	.	.	.	QTL_ID=10628;Name=Backfat at tenth rib;Abbrev=10RIBBFT;PUBMED_ID=12807784;trait_ID=7;trait=Backfat at tenth rib;FlankMarkers;VTO_name=subcutaneous adipose thickness;CMO_name=back fat thickness, 10th rib;Map_Type=Linkage;Model=Mendelian;peak_cM=76;Significance=Significant;P-value=<0.0001;Additive_Effect=0.06;gene_ID=397359;gene_IDsrc=NCBIgene
Chr.1	Animal QTLdb	Reproduction_Association	21576258	21576298	.	.	.	QTL_ID=31297;Name=Teat number;Abbrev=TNUM;PUBMED_ID=25104247;trait_ID=167;trait=Teat number;breed=large white,pietrain,Landrace;FlankMarkers=rs80866387;VTO_name=nipple quantity;CMO_name=teat number;Map_Type=Genome;Model=Mendelian;Test_Base=-;peak_cM=10.40;Significance=Significant;P-value=<0.05;gene_ID=100625610;gene_IDsrc=NCBIgene

See the corresponding QTL entry in the pigQTLdb. Note: The QTL symbol 10RIBBFT in this entry corresponds to Gene symbol 10RIBBFT in the NCBI's Gene database entry (Gene ID: 100529885); however, the GFF line above does not contain a reference to this gene but rather to Gene ID: 397359, which refers to a candidate gene MC4R for this QTL entry. So, the question is where does the 10RIBBFT gene come from (e.g. user input)?

The attributes field (9th column) of the GFF file contains cross-references a.o. to Livestock Product Trait Ontology (LPT, issue #15), Vertebrate Trait Ontology (VT, issue #16) and Clinical Measurement Ontology (CMO, issue #17). Unfortunately, LPT/VT/CMO IDs (part of URIs) are not provided along the human-readable names so one needs to add/reconcile these. Most of the GFF lines contain reference to pig breed(s) (a comma-separated list), which can be mapped to Livestock Breed Ontology (LBO).

Note: When using the Animal QTLdb API the GFF line returned for this QTL differs from the GFF file (issue #8):

curl "www.animalgenome.org/cgi-bin/QTLdb/API/ifetch?q=10628&s=pig&m=GFF"

...
Chr1	Animal QTLdb	Association	155977038	156168386	.	qtlID=10628;name=Backfat at tenth rib;symbl=10RIBBFT;linkageMap=(76);flankMarkers=,,,,;pubmedID=12807784;traitClass=Meat_and_Carcass;vt=VT:1000135;lpt=;cmo=CMO:0000433
...

There are 289 QTLs associated with terms teat number and/or nipple quantity and are found across 19 out of 20 chromosomes:

     #N #chr
     56 7
     51 8
     23 1
     20 12
     19 4
     18 3
     17 6
     14 2
     11 16
     11 15
     11 10
     10 5
      6 11
      5 17
      5 14
      4 9
      4 18
      3 13
      1 X

The trait is referred in 22 publications (see PUBMED_IDs in the attributes field of the GFF file).

zcat qdwnld58435SMJG.txt.gz|grep nipple|perl -lne 'print "[PMID:$1](http://www.ncbi.nlm.nih.gov/pubmed/$1)" if /PUBMED_ID=(\d+)/'|sort -u

PMID:11048919 PMID:11167524 PMID:11263822 PMID:11583418 PMID:12606397 PMID:15537760 PMID:16100052 PMID:16293122 PMID:17032781 PMID:17498628 PMID:18219525 PMID:18651874 PMID:19226448 PMID:21108822 PMID:22443659 PMID:24456574 PMID:24981054* PMID:25104247 PMID:25158056 PMID:25178368 PMID:26202474 PMID:26830357

Note: PMID:24981054 includes suppl. tables in MS Excel files: list of genes in QTLs (Ensembl gene IDs), QTL studies (N=19 instead of 22!)

Note: Use of non-standard feature types (keys) in the 3rd column (see DDBJ/ENA/GenBank Feature Table Definition for reference), namely

 321 Meat_and_Carcass_QTL
  96 Meat_and_Carcass_eQTL
  73 Reproduction_QTL *
  60 Meat_and_Carcass_Association
  51 Production_QTL
  46 Health_QTL
  38 Exterior_QTL
  21 Health_Association
  15 Exterior_Association
  14 Production_Association
   9 Reproduction_Association *

These features could be mapped to misc_feature FT key. Feature types indicated with '*' are associated with the trait of interest.

The following sections are nice to know but not used further in the Virtuoso-based platform (deprecated).

  • Download pig (Sus scrofa, assembly Sscrofa10.2) genome annotation in GFF from the NCBI's Genome database. There are two different download URLs for the GFFs. Alternatively, use genome annotations directly in RDF distribution from Ensembl. Note: How similar/different are these data sets?

Example lines in the GFF file [1,2]:

NC_010443.4	RefSeq	region	1	315321322	.	+	.	ID=id0;Dbxref=taxon:9823;Name=1;breed=mixed;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_010443.4	Gnomon	gene	13708	21219	.	-	.	ID=gene0;Dbxref=GeneID:100620291;Name=LOC100620291;gbkey=Gene;gene=LOC100620291;gene_biotype=protein_coding;partial=true;start_range=.,13708
  • Transform this GFF file into RDF graph.

Feature types/counts in the GFF files ('*' indicates non-standard feature key; '!' indicates diff. counts):

zcat GCF_000003025.5_Sscrofa10.2_genomic.gff.gz|grep -v '#'|cut -f 3|sort|uniq -c|sort -nr

564049 exon
456586 CDS
47405 mRNA
43266 cDNA_match (*)
39488 gene
16715 ncRNA
4583 region (*)
2044 transcript (*, a more specific term?)
 501 tRNA
 343 primary_transcript
 290 match (*)
  91 V_gene_segment (*, V_region?)
   9 C_gene_segment (*, C_region?)
   4 rRNA
   1 repeat_region
   1 D_loop (*, D-loop)

In total 1175376 features.

zcat ref_Sscrofa10.2_top_level.gff3.gz|grep -v '#'|cut -f 3|sort|uniq -c|sort -nr

563942 exon (!)
456477 CDS (!)
47405 mRNA
43266 cDNA_match
39488 gene
16715 ncRNA
4583 region
2044 transcript
 501 tRNA
 343 primary_transcript
 290 match
  91 V_gene_segment
   9 C_gene_segment
   4 rRNA
   1 repeat_region
   1 D_loop

In total 1175160 features.

Note: The GFF files differ in the total numbers of features. This begs the question of which one to use (both NCBI genome build: Sscrofa10.2 and accession: GCF_000003025.5, annotation release ?/105) Moreover, none of these files contain intron and _[five_prime|tree_prime]UTR feature types.

  • Download the pig genome annotations in RDF from EnsEMBL (release 86).

Note: The numbers of SO feature types differ significantly from the GFF files above.

209609	exon
25511	mRNA
21607	protein_coding_gene
1092	snRNA
1092	snRNA_gene
958	pseudogene
877	miRNA
877	miRNA_gene
640	snoRNA_gene
640	snoRNA
618	processed_transcript
474	aberrant_processed_transcript
347	NMD_transcript_variant
229	processed_pseudogene
185	RNA
171	rRNA_gene
171	rRNA
47	lincRNA
35	lincRNA_gene
22	mt_gene
2	pseudogenic_transcript
2	C_gene_segment