Merge pull request #208 from aying2/main

aying2 EC Assignments
JEFworks-Lab · Mar 15, 2024 · ea7d483 · ea7d483
2 parents b9dd952 + 2daaf05
commit ea7d483
Show file tree

Hide file tree

Showing 5 changed files with 275 additions and 1 deletion.
diff --git a/_posts/2024-01-26-aying2.md b/_posts/2024-01-26-aying2.md
@@ -14,7 +14,7 @@ I am visualizing quantitative data of the expression count of the VCAN gene for
 ## What data encodings are you using to visualize these data types?
 I am using the geometric primitive of points to represent each spot on the spatial gene expression slide. To encode the spatial aligned x position, I am using the visual channel of position along the x axis. To encode the spatial aligned y position, I am using the visual channel of position along the y axis. To encode the quantitative expression count of the VCAN gene, I am using both the visual channel of saturation going from an unsaturated light grey to a saturated dark blue and the visual channel of area, going from smaller to larger size for increasing VCAN expression count.
 
-These two visual channels were chosen because according to the data type chart, area has a better than average resolving time for quantitative data and saturation has a slightly worse than average resolving time for quantitative data. It is believed that in conjunction, leveraging both visual channels will increase saliency of the quantitative VCAN expression count. Additionally, saturation has a better resolving time for categorical data than area so it may help the viewer identify areas of common VCAN expression more easily. Note that while VCAN expression count is the main quantitative variable and so encoding it with the x or y axis position would have theoretically resulted in the best resolving time, this would require moving the spatial aligned x or y position encoding to a different visual channel. It was judged that doing this would decrease the saliency of the position information more than the possible benefit because of how unintuitive it is to have spatial variables not encoded by position.
+These two visual channels were chosen because according to the data type chart, area has a better than average resolving time for quantitative data and saturation has a slightly worse than average resolving time for quantitative data. It is believed that in conjunction, leveraging both visual channels will increase saliency of the quantitative VCAN expression count. Note that while VCAN expression count is the main quantitative variable and so encoding it with the x or y axis position would have theoretically resulted in the best resolving time, this would require moving the spatial aligned x or y position encoding to a different visual channel. It was judged that doing this would decrease the saliency of the position information more than the possible benefit because of how unintuitive it is to have spatial variables not encoded by position.
 
 ## What type of data visualization is this? What about the data are you trying to make salient through this data visualization? What Gestalt principles have you applied towards achieving this goal if any?
 This data visualization is a scatter plot. My explanatory data visualization seeks to make more salient the relationship between VCAN expression count and the aligned x and y position of the spot. Because the x and y positions of the spot correspond to locations and structures in the tissue sample, this data visualization can make more salient the relationship between VCAN expression and certain locations and structures in the tissue. I have applied the Gestalt principle of enclosure to separate the size and color encoding legend from the plot with boxes. The Gestalt principle of proximity is also present in the legend because the size and color keys are next to each other. The Gestalt principle of similarity is used to identify areas of high or low VCAN gene expression count, since they will have similar color saturation. For example, the bottom right is a saturated dark blue and an area of high VCAN gene expression while the top left is an unsaturated gray and an area of low VCAN gene expression.

diff --git a/_posts/2024-03-05-aying2.md b/_posts/2024-03-05-aying2.md
@@ -0,0 +1,158 @@
+---
+layout: post
+title:  "gganimate: Visualizing IGKC in tSNE Space with Non-linear Dimensionality Reduction on Varying Numbers of PCs"
+author: Andrew Ying
+jhed: aying2
+categories: [ HW EC1 ]
+image: homework/hwEC1/hwEC1_aying2.gif
+featured: false
+---
+
+## What data types are you visualizing?
+For the plots on the left side of the animation, I am visualizing the quantitative data of the X1 and X2 tSNE embedding values, and qualitative data of the IGKC expression for each spot.
+
+For the plots on the right side of the animation, I am visualizing quantitative data of the standard deviation of the principal component, and ordinal data of the number assigned to the principal component.
+
+## What data encodings are you using to visualize these data types?
+
+For the plots on the left side of the animation, I am using the geometric primitive of points to represent each spot on the spatial gene expression slide. To encode the X1 embedding value, I am using the visual channel of position along the x axis. To encode the X2 value, I am using the visual channel of position along the y axis. To encode the quantitative IGKC expression, I am using the visual channel of saturation going from an unsaturated light grey to a saturated red.
+
+For the plots on the right side of the animation, I am using the geometric primitive of points to represent each principal component. To encode the quantitative standard deviation of the principal component, I am using the visual channel of position along the x axis. To encode the ordinal number assigned to the principal component, I am using the visual channel of position along the y axis.
+
+The visual channels were chosen because according to the data type chart, 
+position has the best resolving time, so it was used for the X1 and X2 embedding values, PC standard deviation, and PC ordinal number. Saturation was chosen to encode IGKC expression because it has a moderate resolving time for quantitative data and would not result in overlap between points in the tSNE plot, like area would.
+
+## What type of data visualization is this? What about the data are you trying to make salient through this data visualization? What Gestalt principles have you applied towards achieving this goal if any?
+
+The plots on the left side of the animation are scatterplots. The plots on the right side of the animation are line plots.
+
+My explanatory data visualization seeks to make more salient the effect of the number of PCs used on the output of nonlinear dimensionality reduction in tSNE space. It also makes salient the IGKC expression for each spot in tSNE space, with the saturation channel used for IGKC expression helping viewers track various groups of spots across frames of the animation. The visualization also makes salient the standard deviation for each principal component, such that the viewer can gauge how the changes in the tSNE plot from using more PCs relates to the standard deviation of those PCs.
+
+The Gestalt principles of proximity and similarity are present because the tSNE spatial plot and standard deviation using the same number of PCs are adjacent to each other throughout the animation. The Gestalt principle of continuity is used because the frames of the animation are in increasing order for the number of PCs used for tSNE and the standard deviation plot.
+
+## Please share the code you used to reproduce this data visualization.
+```{r}
+data <-
+    read.csv("genomic-data-visualization-2024/data/eevee.csv.gz",
+             row.names = 1)
+data[1:10, 1:10]
+pos <- data[, 2:3]
+gexp <- data[, 4:ncol(data)]
+
+# from lesson 5
+topgene <- names(sort(apply(gexp, 2, var), decreasing = TRUE)[1:1000])
+gexpfilter <- gexp[, topgene]
+
+
+# code taken from Dr. Fan's code-lesson-5.R
+# gexpnorm <- log10(gexpfilter/rowSums(gexpfilter) * mean(rowSums(gexpfilter))+1)
+gexpnorm <- log10(gexp/rowSums(gexp) * mean(rowSums(gexp))+1)
+
+? prcomp
+pcs <- prcomp(gexpnorm)
+plot(pcs$sdev, type = 'o')
+
+library(ggplot2)
+ggplot(data.frame(pos, gexpnorm)) + 
+    scale_colour_gradient(low = 'lightgrey', high = 'darkred') +
+    geom_point(aes(x= aligned_x, y=aligned_y, color = IGKC)) + 
+    theme_minimal()
+
+? Rtsne
+library(Rtsne)
+
+nsteps = ceiling(log2(length(pcs$sdev)))
+sdev_plts <- list()
+tsne_plts <- list()
+df_anim <- data.frame()
+df_sdev <- data.frame()
+for (i in 1:nsteps) {
+    npcs = 2^i
+    
+    s <- ""
+    if (npcs > length(pcs$sdev)) {
+        npcs = length(pcs$sdev)
+        s <- "all "
+    }
+    
+    print(paste(i, npcs))
+    
+    sdev_df = data.frame(sdev = pcs$sdev[1:npcs])
+    
+    sdev_df$PC = as.numeric(rownames(sdev_df))
+    
+    sdev_plts[[i]] <- ggplot(sdev_df, aes(x = PC, y = sdev, grouping = 1)) + 
+        geom_point() + geom_line() + 
+        labs(
+            title = sprintf(
+                'sdev vs. PC (%d PCs total)',
+                npcs
+            )
+        ) +
+        theme_bw()
+
+    set.seed(42)
+    emb <- Rtsne(pcs$x[,1:npcs])$Y
+    
+    df <- data.frame(emb, gexpnorm)
+    df_anim <- rbind(df_anim, cbind(df, npcs = npcs))
+    df_sdev <- rbind(df_sdev, cbind(sdev_df, npcs = npcs))
+    
+    tsne_plts[[i]] <-
+        ggplot(df) + geom_point(aes(x = X1, y = X2, color = IGKC)) +
+        scale_colour_gradient(low = 'lightgrey', high = 'darkred') +
+        labs(
+            title = sprintf(
+                'IGKC vs. X2 vs. X1 (tSNE on %s%d PCs)',
+                s,
+                npcs
+            )
+        ) +
+        theme_bw()
+}
+    
+sdev_plts[[1]]
+tsne_plts[[10]]
+
+library(gganimate)
+main_plt <- ggplot(df_anim) + geom_point(aes(x = X1, y = X2, color = IGKC)) +
+    scale_colour_gradient(low = 'lightgrey', high = 'darkred') +
+    labs(
+        title = 'IGKC vs. X2 vs. X1 (tSNE on {closest_state} PCs)'
+    ) +
+    theme_bw()
+main_anim <- main_plt + 
+    transition_states(npcs,
+            state_length = 2,
+            transition_length = 1) +
+    ease_aes('sine-in-out')
+
+sdev_plt <- ggplot(df_sdev, aes(x = PC, y = sdev, grouping = 1)) + 
+    geom_point() + geom_line() + 
+    labs(
+        title = 
+            'sdev vs. PC ({closest_state} PCs total)'
+        
+    ) +
+    theme_bw()
+sdev_anim <- sdev_plt + 
+    transition_states(npcs,
+                      state_length = 2,
+                      transition_length = 1) +
+    ease_aes('sine-in-out')
+
+main_gif <- animate(main_anim, renderer = magick_renderer())
+sdev_gif <- animate(sdev_anim, renderer = magick_renderer())
+
+i=1
+new_gif <- image_append(c(main_gif[i], sdev_gif[i]))
+for(i in 2:100){
+    combined <- image_append(c(main_gif[i], sdev_gif[i]))
+    new_gif <- c(new_gif, combined)
+}
+
+new_gif
+
+image_write(new_gif, "aying2.gif")
+
+```
diff --git a/_posts/2024-03-06-aying2.md b/_posts/2024-03-06-aying2.md
@@ -0,0 +1,116 @@
+---
+layout: post
+title:  "Analyzing Connectivity of Spleen CODEX dataset using CRAWDAD on K-means Clusters"
+author: Andrew Ying
+jhed: aying2
+categories: [ HW EC3 ]
+image: homework/hwEC3/hwEC3_aying2.png
+featured: false
+---
+
+## What data types are you visualizing?
+
+For plot A, I am visualizing the spatial data of the x and y positions for each cell, and categorical data of the cluster the cell belongs to. I am using the geometric primitive of points to represent each cell. To encode the spatial x position, I am using the visual channel of position along the x axis. To encode the spatial y position, I am using the visual channel of position along the y axis. To encode the categorical cluster the cell belongs to, I am using the visual channel of hue.
+
+For plot B, I am visualizing the quantitative data of the X1 and X2 tSNE embedding values, and categorical data of the cluster the cell belongs to. I am using the geometric primitive of points to represent each cell. To encode the X1 embedding value, I am using the visual channel of position along the x axis. To encode the X2 value, I am using the visual channel of position along the y axis. To encode the categorical cluster the cell belongs to, I am using the visual channel of hue.
+
+For plot C, I am visualizing the categorical data of the neighbor cluster using the y axis position. I am visualizing the categorical data of the reference cluster using the x axis position. I am visualizing the quantitative data of the z-score using the visual channel of color hue. I am visualizing the quantitative data of the scale using the visual channel of area. 
+
+For plots D and E, I am visualizing the quantitative data of the z-score using the visual channel of y axis position. I am visualizing the quantitative data of the scale using the visual channel of x axis position. 
+
+The Gestalt principle of similarity and proximity because D and E are both line plots and are adjacent, and plots A and B which have the same color scheme are adjacent.
+
+
+## Please share the code you used to reproduce this data visualization.
+```{r}
+data <-
+    read.csv("genomic-data-visualization-2024/data/codex_spleen_subset.csv.gz",
+             row.names = 1)
+
+data[1:10, 1:10]
+pos <- data[, 1:2]
+area <- data[, 3]
+pexp <- data[4:ncol(data)]
+
+pexpnorm <- log10(pexp/area * mean(area)+1)
+
+library(ggplot2)
+ggplot(data.frame(pos, area)) + geom_point(aes(x=x, y=y, col=area))+
+    scale_color_gradient(high = "darkred", low = "gray")
+
+library(Rtsne)
+set.seed(42)
+emb <- Rtsne(pexpnorm, perplexity=15)$Y
+
+set.seed(42)
+tw <- sapply(1:15, function(i) {
+    print(i)
+    kmeans(pexpnorm, centers=i, iter.max = 50)$tot.withinss
+})
+plot(tw, type='o')
+
+set.seed(42)
+com <- as.factor(kmeans(pexpnorm, centers=7)$cluster)
+
+p1 <- ggplot(data.frame(pos, pexpnorm, com)) +
+    geom_point(aes(x = x, y = y, col = com), size = 1)
+
+p2 <- ggplot(data.frame(emb, pexpnorm, com)) +
+    geom_point(aes(x = X1, y = X2, col = com), size = 1)
+
+## https://github.com/JEFworks-Lab/CRAWDAD/blob/main/docs/3_spleen.md
+library(crawdad)
+crawdad_df <- data.frame(x = pos[,1], y = pos[,2], com)
+
+ncores <- 8
+set.seed(42)
+## convert to sp::SpatialPointsDataFrame
+seq <- crawdad:::toSF(pos = crawdad_df[,c("x", "y")],
+                        celltypes = crawdad_df$com)
+set.seed(42)
+## generate background
+shuffle.list <- crawdad::makeShuffledCells(seq,
+                                           scales = seq(100, 1000, by=100),
+                                           perms = 3,
+                                           ncores = ncores,
+                                           seed = 1,
+                                           verbose = TRUE)
+set.seed(42)
+## find trends, passing background as parameter
+results <- crawdad::findTrends(seq,
+                               dist = 50,
+                               shuffle.list = shuffle.list,
+                               ncores = ncores,
+                               verbose = TRUE, 
+                               returnMeans = FALSE) # for error bars
+
+set.seed(42)
+## convert results to data.frame
+dat <- crawdad::meltResultsList(results, withPerms = T)
+
+## multiple-test correction
+ntests <- length(unique(dat$reference)) * length(unique(dat$reference))
+psig <- 0.05/ntests # bonferroni correction
+zsig <- round(qnorm(psig/2, lower.tail = F), 2)
+
+p3 <- vizColocDotplot(dat, reorder = FALSE, zsig.thresh = zsig, zscore.limit = zsig*2, dot.sizes = c(6, 20)) +
+    theme(legend.position='right',
+          axis.text.x = element_text(angle = 45, h = 0))
+p3
+
+library(tidyverse)
+dat_filter <- dat %>% 
+    filter(reference == '2') %>% 
+    filter(neighbor == '6')
+p4 <- vizTrends(dat_filter, lines = T, withPerms = T, sig.thresh = zsig)
+
+
+dat_filter <- dat %>% 
+    filter(reference == '1') %>% 
+    filter(neighbor == '2')
+p5 <- vizTrends(dat_filter, lines = T, withPerms = T, sig.thresh = zsig)
+
+library(patchwork)
+p1 + p2 + p3 + p4 + p5 + plot_annotation(tag_levels = 'A') + plot_layout(nrow = 2, ncol = 3)
+
+```
diff --git a/homework/hwEC1/hwEC1_aying2.gif b/homework/hwEC1/hwEC1_aying2.gif
diff --git a/homework/hwEC3/hwEC3_aying2.png b/homework/hwEC3/hwEC3_aying2.png