GitHub - kwcurrin/semantic_similarity

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
info/results		info/results
scripts		scripts
README.txt		README.txt
Repository files navigation

Project Title: Evaluation of semantic similarity and expect scores to detect signal from noise using a random decay approach

Author: Kevin Currin 

PI: Dr. Todd Vision

Post-Doctoral Mentor: Dr. Prashanti Manda

This repository contains two folders.
1. Scripts: This folder contains all of the scripts used in the experiment.
2. info: This folder contains files linked to each version of the code that describe the experimental design and results.

A brief description of each script used in this experiment is below. If a script is not listed, it is currently not used in the experiment.

Scripts:
1. parseRealProfiles.py
This script contains one function: parse_real_profiles.
This function uses the python rdflib library to read the ttl file in which the phenoscape database profiles are stored (data/realprofilesold.ttl).
Only VTO profiles are extracted.
The granularity of terms is just entity currently.
The function creates two output files, the names of which are specified on the commandline:
A. A file of all annotations across the selected VTO profiles (data/annotations.txt).
B. A two column tab-delimited file with profile ID as the first column and profile size (number of annotations) as the second.
2. makeRandomProfiles.py
This script contains three functions:
A. with_replacement: (currently used) This function generates random profiles of the same number and profile size distribution of the original database profiles. 
The random sampling of annotations for each individual profile is performed without replacement.
However, the random sampling between profiles is with replacement.
That is, annotations that are assigned to one profile are not removed from the annotation pool used to generate other random profiles.
This function both returns a dictionary of random profiles and uses cPickle to write the dictionary to an output file specified on the commandline.
Currently, this function is ran in "stand-alone" form to produce the file with the pickled dictionary to reduce downstream computation time. 
The file generated by this function that is used in downstream scripts is kure_mirror/scripts/random_profiles.txt.
B. without_replacement: (currently not used or maintained): This function generates random profiles of the same number and size distribution as the original database profiles.
The random sampling of annotations for each individual profile is performed without replacement.
The random sampling between profiles is also without replacement.
That is, annotations assigned to one random profile are removed from the pool of annotations used to generate other random profiles.
This function only returns a dictionary of random profiles (not currently used).
C. read_files: This function reads the annotations.txt file into a list and the profile_sizes.txt file into a list of tuples for use by the other two functions in the script.
3. getQueryProfiles.py
This script contains one function, get_query_profiles, which selects a specified number of profiles of uniform size from a provided profile database.
The profile database, number of queries, and query size is specified in the function call.
A dictionary of the queries is returned.
This script is meant to be used as part of a pipline rather than a stand-alone script.
If the number of queries specified is greater than the total number of queries of the specified size in the database, an error is raised.
4. getRelationships.py
This script contains two functions:
A. get_relationships: This function currently generates a dictionary of ancestral relationships with children as keys and ancestors as values.
The ancestral relationships are determined from the super class file Subsumers_EAttr(Q)_OldRealProfiles.txt.
A pickled version of the dictionary is written to a file (ancestors.txt is the name of the file I generated with this script).
Thus, this function is meant to be ran on its own to reduce computation time in the pipeline of running the random decay experiment.
B. get_children: Note, this function is currently not used or maintained.
5. calcAllICs.py
This script contains two functions:
A. annotation_denominator: This function calculates the information content (IC) for all annotations using the frequency of an annotation in the entire annotation pool.
The annotation counts of children are added to their ancestors before the IC calculations are performed.
A dictionary of annotations as keys and their ICs as values is returned.
B. taxa_denominator: Note, this function is currently not used, but we plan to use it in the future.
Thus, this function is not currently maintained and will need to be updated when we start using it.
This function calculates IC using the total number of taxa as the denominator of the frequency calculation instead of the total number of annotations.
Our ultimate goal is to compare these two IC computation methods.
6. makeMultipleLinearModel.py
This script contains two functions:
A. make_multi_lm: This function calculates the similarity score between each possible combination of profiles in a list of random queries and a random profile database.
The similarity score is the median of the asymmetric IC best pairs.
The log sizes of the query and database profiles for each comparison and the corresponding similarity score are written to a file for input into a multiple linear model script.
The function also contains code to run the multiple linear model and write the resulting fitted values, parameters, hat matrix, and mean squared error to a file in a pickled tuple (lm_results.txt).
Note, lm_results.txt is currently still used in the random decay experiment.
The reason the code is currently commented out is that I wanted to generate a file with just the input values for the linear regression to compare the linear regression results between R and python.
B. make_random_queries: This function generates the random queries used in make_multi_lm.
The queries are generated by first selecting one profile size for every existing profile size in the random database and randomly assigning annotations to each profile.
The random assignment of annotations is with replacement with in a profile and with replacement between profiles.
7. calcMedianBestIC.py
This script contains two functions:
A. symmetric_comparison: Note, this function is currently not used.
For each profile in a database, this script calculates the median of the best IC pairs with respect to the query, the median of the best IC pairs with respect to the database profile, and calculates the mean of these two medians.
The best match (highest mean) is returned.
This script is currently not maintained because we first wanted to use the asymmetric comparison method for our experiment.
However, our ultimate goal is to compare the symmetric and asymmetric comparison methods.
B. asymmetric_comparison: This function calculates the median between the best IC pairs with respect to the query terms only for each database profile.
The best match (highest median) is kept.
An expect score is then calculated using this similarity score and the log sizes of the query and best match database profile.
A tuple containing the ID of the best database match, the corresponding similarity score, the list of the best pairs, and the expect score is returned.
8. diluteQuery.py: This script contains one function, dilute_query.
This function takes a query profile and replaces one of its indices with a random annotation.
The annotations originally present in the query are excluded from the pool of replacement annotations.
The function contains a flag to determine whether or not to also exclude any annotations that were used as replacements previously in the diluted query.
If this flag is set to exclude such annotations for every replacement in the query, then the diluted query will only contain unique annotations.
9. main.py: This script contains one function, main.
This function runs the main pipeline of the random decay experiment.
This function produces an output file containing a matrix of:
query ID, number of replacements, best profile match ID, similarity score, expect score, and the list of best pairs for the match.