diff --git a/user_utils/normalize/README.md b/user_utils/normalize/README.md index 1d1b9ab..aa9cded 100644 --- a/user_utils/normalize/README.md +++ b/user_utils/normalize/README.md @@ -69,3 +69,62 @@ perl compare_clusters.pl -m -o intersection -d \ your_sequences_homologues/clusters_0taxa_algOMCL_e0_raw,your_sequences_homologues/clusters_0taxa_algOMCL_e0_norm \ &> log.intersection ``` + +## Effect on the clustering + +__1) Protein datasets__ + +Sequence clusters produced by the standard and the normalized version of GET_HOMOLOGUES-EST +on the peptides of 119 isolates of the bacterial Streptophomonas genus showed some differences. +There were 23,249 identical clusters in both runs, but 1,190 and 1,359 clusters unique +to the standard and normalized set up, respectively. The number of clusters containing a single sequence +(singletons) was greater (329 vs 225) and more distributed across peptide length after the normalization step (Figure 1). + +![singleton_dist_prot](images/singleton_len_prot.png) + +*Figure 1. Length distribution of singletons in the original and normalized clustering. Singleton +sequence length after normalization is more evenly distributed across peptide length.* + +Some sequences originally found in clusters were clasified as singletons after normalization. +Subtracting those sequences did not have any effect on the overall percentage of sequence identity of the clusters, +which indicates these may be miss-clustered after normalization. However, some outliers among long sequences +increase cluster mean identity when moved to singleton clusters, indicating the use of the normalization process +for building high quality clusters of long sequences to be used in phylogenetic analyses (Figure 2). + +![diff_identity](images/effect_prot_id.png) + +*Figure 2. Difference in cluster % sequence identity before and after removing sequences because of +the normalization process in different length regions. Positive values indicate an increase in identity +after removing a sequence after normalization. Length was measured as the mean alignment length +reported by BLASTP for sequences within the cluster.* + +__2) Nucleotide datasets__ + +The clusters sets produced by the standard and normalized predictions by the +GET_HOMOLOGUES-EST protocol were very different with the transcripts of 11 species of the genus Oryza. +In particular, there were 111,964 identical gene clusters, and 14,779 and 38,524 unique clusters in the original +and normalized results, respectively. Moreover, the number of singletons calculated by the standard +program was 612, whereas after normalization it increased to 18,126. The number of singletons +was, as in the protein dataset example, more distributed across nucleotide length after the +normalization step (Figure 3). + +![singleton_dist_nucl](images/singleton_len_nucl.png) + +*Figure 3. Length distribution of singletons in the original and normalized clustering. Singleton +sequence length after normalization is more evenly distributed across peptide length.* + +The mean BLAST coverage values of the clusters usually increased after the subtraction of +sequences because of the normalization. Manual inspection of some cases revealed that long sequences +were subtracted from the original clusters and classified assingletons even when some regions +aligned without mismatches with other sequences of the cluster (Figure 4). This effect of the normalization +process might not be desired if users want to make clusters of CDS and transcripts, even if they only +share some particular regions, such as exons, but not other regions such as introns, only present +in transcripts. In most cases, susbtracting sequences because of normalization did not have an effect on +the overall indentity of the clusters. + +![effect_cov_nucl](images/effect_nucl_cov.png) + +*Figure 4. Coverage of the clusters before (x-axis) and after (y-axis) new singletons were sunstracted +from original clusters because of the normalization process.* + + diff --git a/user_utils/normalize/images/effect_nucl_cov.png b/user_utils/normalize/images/effect_nucl_cov.png new file mode 100644 index 0000000..8641724 Binary files /dev/null and b/user_utils/normalize/images/effect_nucl_cov.png differ diff --git a/user_utils/normalize/images/effect_prot_id.png b/user_utils/normalize/images/effect_prot_id.png new file mode 100755 index 0000000..09ff741 Binary files /dev/null and b/user_utils/normalize/images/effect_prot_id.png differ diff --git a/user_utils/normalize/images/singleton_len_nucl.png b/user_utils/normalize/images/singleton_len_nucl.png new file mode 100755 index 0000000..4151703 Binary files /dev/null and b/user_utils/normalize/images/singleton_len_nucl.png differ diff --git a/user_utils/normalize/images/singleton_len_prot.png b/user_utils/normalize/images/singleton_len_prot.png new file mode 100755 index 0000000..cba490d Binary files /dev/null and b/user_utils/normalize/images/singleton_len_prot.png differ