AlgoReCell project
Binarization: QUALITATIVE INTERPRETATION OF BULK RNA-SEQ DATA. We previously performed experiments for acquiring RNA sequencing data in populations of ST2 mice cells at different stages of differentiation towards adipocytes and osteoblasts [PAPER REF; DATA REF]. Adipocyte differentiation was induced using a medium of isobutylmethylxanthine, dexamethasone, and insulin for 2 days, followed by a change of medium with rosiglitazone insulin (Sigma-Aldrich, I9278) until 9 days. Differentiation towards osteoblasts was induced using bone morphogenetic protein-4 until 9 days. Each experiment has been replicated 3 times. At days 0, 1, 3, 5, 9, 15, a subpopulation of the cells has been sequenced genome-wide. In the scope of this project, we focused on the activity of transcription factors (TFs) in the different stages of the ST2 differentiation. To enable building of qualitative models for the differentiation process, an automated method was applied for transforming the quantitative RNA-seq measurements of TF activities into qualitative assessment of their activity: active (1), inactive (0), or undetermined (intermediate). Two different methods of binarization were employed and compared: RefBool [PMID: 28334101] for binarization with respect to background RNAseq data collected in a range of different cell types; and a statistical analysis developed in the scope of thi project, for binarization with respect to the collected RNA-seq data. For this statistical analysis, representative background distributions of gene expressions in 67 mouse tissues were generated from ArchS4 data [https://maayanlab.cloud/archs4/, Kallisto raw read counts, retrieved 2019]. Independent vertex sampling was performed per tissue to remove correlated samples. Samples were further filtered for overall read counts between 10 and 100 Mio and a median>=1 to remove unusual distributions and outliers. This background data was merged with our own data (after Kallisto alignment) and then quantile normalized and converted to TPM (gene length normalization and TPM scaling). The gene-specific background distributions were then applied for discretization of the own data as follows: Gene expressed in own data below median of background -> 0; above upper quartile -> 1. In addition, genes with TPM<1 are discretized to 0. Furthermore, genes with large expression differences over all samples and time-points were identified via k-means-clustering (2 clusters), with a minimum 3-fold-change of centroid locations, at least 3 data points per cluster, and full-filling a ttest2 (Matlab) between the two clusters. 37 genes were found accordingly. The upper cluster was discretized to 1 and the lower cluster to 0. The respective Matlab code for this statistical analysis is available at: https://github.com/sysbiolux
METACORE® database for TF interactions: We extracted from the METACORE database (in 2019) all known interactions (Transcription regulation, Influence on expression & Binding) between TFs [REF-MerjaPaper] and 7 known marker genes which show clear expression changes from adipocytes to osteoblasts in our own expression data: 4 high in adipocytes (ADIPOQ, FABP4, CEBPA, LPL) and 3 high in osteoblasts (ALPL, HEY1, SP7). Only measured nodes and their interactions were kept. This resulted in a network of 1,027 nodes (almost only TFs) with ~11,000 pairwise interactions.