Skip to content

Computational methods for predicting Autism Spectrum Disorder from gene expression data

Notifications You must be signed in to change notification settings

thinng/asd_feat

Repository files navigation

Resources:

  • README.md: this file.
  • GSE18123GPL570.csv and GSE18123GPL6244.csv: two datasets
  • result.csv: result for each method with different number of features. Measured by accuracy, auc, f1, and mcc.

source codes:

  • conventional_feature_ranking.py: 12 conventional feature ranking methods plus AUC (baseline) and SFARI.
  • gene_interaction_co_expression.py: create co-expression networks.
  • gene_interaction_pathway.py: create pathway networks ('DO', 'GObp','GOcc', 'GOmf', and 'HPO' networks).
  • gene_interaction_PPI.py: create PPI networks.
  • gene_interaction_pubmed.py: create pubmed-embedding networks.
  • gene_embedding.py: create word, hence gene name included, embedding, used as input for gene_interaction_pubmed.py
  • geneRank.m: score genes based on their networks: co-expression, pathway, PPI, and pubmed-embedding networks
  • score_2_index.py: turn the score returned by geneRank.m into the index of genes in the datasets.
  • classify.py: running the classification task with the features ranked highest by the above methods, be it conventional or network-based.

folder pubmed/:

  • pubmed.txt (unzip pubmed.7z): contain PubMed raw text as input to learn word/gene embeddings.
  • gene_embedding.embed (unzip gene_embedding.7z): word/gene embeddings returned by running word2vec models on pubmed.txt.
  • gene_name_ref.txt: mapping vocabulary (gene names) between gene_embedding.embed and the datasets.

folder mapping/ (unzip mapping.7z):

folder gene_interaction/ (upzip all *.7z files into *.csv files):

  • store the output of gene_interaction_*.py, creating gene-gene networks.
  • gene_interaction/score/: store the output of geneRank.m, scoring genes by their networks.

folder feat_ranking/ (unzip feat_ranking.7z):

Gene ranking, either return by conventional_feature_ranking.py or score_2_index.py, which is the network-based feature ranking. This will be the input for classify.py, telling which features to be used in the classification.

Step-by-step running:

Please unzip all *.7z files required before running any algorithms. Please note that some may take hours to run.

0. Installing Python libaries needed for

  • conventional feature selections:
pip install skfeature-chappers
  • training embbeded vectors:
pip install gensim

1. Running conventional feature ranking:

python conventional_feature_ranking.py index

where index is in the range of the list of conventional methods ['chi_square', 'cmim', 'f_score', 'fisher_score', 'gini_index', 'icap', 'jmi', 'll_l21', 'ls_l21', 'reliefF', 'rfs', 'trace_ratio', 'SFARI', 'AUC'] The ranking of genes will be stored at the folder of feat_ranking/

2. Building gene-gene networks:

  • co_expression interactions:
python gene_interaction_co_expression.py
  • pathway interactions:
python gene_interaction_pathway.py

The mapping of gene-ontologies is stored at mapping/hgncTo*.csv. There are five ontologies: 'DO', 'GObp','GOcc', 'GOmf', and 'HPO'.

  • PPI networks:
python gene_interaction_PPI.py

The material for building PPI is stored at mapping/hippie_current.txt, downloaded from http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/hippie_current.txt

  • PubMed based networks: For learning gene-gene interactions from PubMed, first the embedding for genes should be learned by running:
python gene_embedding.py

Raw data extracted from PubMed is stored at /pubmed/pubmed.txt. The embbed model is stored at pubmed/gene_embedding.embed. Then the network is learned through:

python gene_interaction_pubmed.py
  • Running all these python programs result in gene-gene networks stored as gene_interaction/

3. Running network-based feature ranking for these networks by running:

geneRank.m

The score for genes returned by running geneRank.m is stored at /gene_interaction/score/. We convert the score to index for genes by running:

python score_2_index.py

The ranking of genes returned by running score_2_index.py will be stored at the folder of feat_ranking/

4. Now we have feature ranking for both conventional and network-based methods. Then the features with highest ranks will be used in the following classification:

python classify.py

The result for all these feature selections returned by running classify.py will be stored in result.csv.

About

Computational methods for predicting Autism Spectrum Disorder from gene expression data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published