Skip to content

Latest commit

 

History

History
109 lines (66 loc) · 4.3 KB

README.md

File metadata and controls

109 lines (66 loc) · 4.3 KB

Cancer Gene Expression by Machine Learning

1

This project aims to predict people who will survive breast cancer using machine learning models with the help of clinical data and gene expression profiles of the patients. Using machine learning models on gene data helps us better understand the disease and determine how well treatment methods work.

What is Breast Cancer?

Breast cancer is a type of cancer that occurs when some cells in the breast grow excessively. This type of cancer causes the greatest number of cancer-related deaths among women. There are 5 known types of breast cancer based on the occurrence of the tumor as you can see in the figure. These genes can be also passed from parent to child. Therefore, breast cancer is also a genetically transmitted cancer type.

2

Dataset

This dataset includes 31 clinical attributes, mRNA levels z-score for 331 genes, and mutation in 175 genes for 1904 breast cancer patients.

Importance of the Genes

BRCA1 and BRCA2: The most common cause of hereditary breast cancer is an inherited mutation in the BRCA1 or BRCA2 gene. In normal cells, these genes help make proteins that repair damaged DNA. Mutated versions of these genes can lead to abnormal cell growth, which can lead to cancer.

Other Genes: These gene mutations are much less common, and most of them do not increase the risk of breast cancer as much as the BRCA genes. Some of them are ATM, TP53, CHEK2, PTEN, CDH1, STK11 and PALB2.

Data Preprocessing

In the preprocessing part, first I cleaned the data and according to skew-distribution fill the missing values with the mode of them. After that, I applied Label Encoding and One-Hot Encoding to the categorical clinic data respectively.

Label Encoding:

The following categorical features are ordinal data encoded with the label encoding method.

  • cellularity
  • cancer_type_detailed
  • type_of_breast_surgery
  • her2_status_measured_by_snp6
  • pam50_+_claudin-low_subtype
  • her2_status_measured_by_snp6
  • tumor_other_histologic_subtype
  • integrative_cluster
  • 3-gene_classifier_subtype
  • death_from_cancer

One-Hot Encoding:

The following categorical features are nominal data encoded with the one-hot encoding method.

  • er_status_measured_by_ihc
  • er_status, her2_status
  • inferred_menopausal_state
  • primary_tumor_laterality
  • pr_status
  • oncotree_code

Data Visualization

In order to better understand the data and make some inferences, I analyzed using the following data visualization techniques.

Correlation Matrix

31

Treatment Types & Survivals

36

Survived Patients by Treatment Groups

43

Choosing Best PCA

PCA is an algorithm that reduces the dimension of datasets that have a high number of features. When dealing with the PCA, the scikit-learn library can be used to determine the number of components of the newly created data set. As you can see in the below plot, to get 95% of variance explained I need 375 principal components.

52

Accuracy Scores

Model Train Score Test Score
KNN 0.75 0.73
Logistic Regression 0.82 0.89
Random Forest 1.0 0.62
Decision Tree 0.69 0.62
Extra Trees 0.93 0.58
AdaBoost 1.0 0.65

63

References:

  1. https://www.nature.com/articles/s41523-018-0056-8
  2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461908/