Skip to content

Commit

Permalink
Investigate build error (#12)
Browse files Browse the repository at this point in the history
* - removed psych
- added marginal plots

* updated travis link
added news text
  • Loading branch information
o1iv3r authored Dec 28, 2020
1 parent fc5a7a6 commit 4f0fadd
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 35 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: ClustImpute
Type: Package
Title: K-means clustering with build-in missing data imputation
Version: 0.1.6
Version: 0.1.7
Author: Oliver Pfaffel
Maintainer: Oliver Pfaffel <[email protected]>
Description: This clustering algorithm deals with missing data via weights that are imposed on missings and successively increased. See the vignette for details.
Expand All @@ -15,7 +15,7 @@ Imports:
magrittr,
rlang
Suggests:
psych,
ggExtra,
ggplot2,
knitr,
rmarkdown,
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# ClustImpute 0.1.7

* Removed dependency from psych package.
* Added marginal plots to the vignette via ggExtra.

# ClustImpute 0.1.6

* Added vignette that describes the algorithm in more detail.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# <img src="man/figures/logo.png" align="right" width="90" />

<!-- badges: start -->
[![Travis build status](https://travis-ci.org/o1iv3r/ClustImpute.svg?branch=master)](https://travis-ci.org/o1iv3r/ClustImpute)
[![Travis build status](https://travis-ci.com/o1iv3r/ClustImpute.svg?branch=master)](https://travis-ci.com/o1iv3r/ClustImpute)
[![Codecov test coverage](https://codecov.io/gh/o1iv3r/ClustImpute/branch/master/graph/badge.svg)](https://codecov.io/gh/o1iv3r/ClustImpute?branch=master)
![CRAN_Version](https://www.r-pkg.org/badges/version-last-release/ClustImpute)
![CRAN_Downloads](https://cranlogs.r-pkg.org/badges/grand-total/ClustImpute)
Expand Down
51 changes: 19 additions & 32 deletions vignettes/Example_on_simulated_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,17 @@ dat<- as.data.frame(scale(dat)) # scaling
summary(dat)
```

One can clearly see the three clusters
One can clearly see the three clusters of the randomly generated data:

```{r}
plot(dat$x,dat$y)
library(ggExtra)
dat4plot <- dat
dat4plot$true_clust_fct <- factor(true_clust)
p_base <- ggplot(dat4plot,aes(x=x,y=y,color=true_clust_fct)) + geom_point()
ggMarginal(p_base, groupColour = TRUE, groupFill = TRUE)
```


We create a 20% missings using a custom function

```{r}
Expand All @@ -67,37 +72,39 @@ corrplot(cor(mis_ind),method="number")
# Median or random imputation


Clearly, an imputation with the median value does a pretty bad job here:
Clearly, an imputation with the median value does a pretty bad job here. All imputed values lie on either of the two axes thereby completely distorting the marginal distributions:

```{r}
dat_median_imp <- dat_with_miss
for (j in 1:dim(dat)[2]) {
dat_median_imp[,j] <- Hmisc::impute(dat_median_imp[,j],fun=median)
}
imp <- factor(pmax(mis_ind[,5],mis_ind[,6]),labels=c("Original","Imputed")) # point is imputed if x or y is imputed
ggplot(dat_median_imp) + geom_point(aes(x=x,y=y,color=imp))
p_median_imp <- ggplot(dat_median_imp) + geom_point(aes(x=x,y=y,color=imp))
ggMarginal(p_median_imp,groupColour = TRUE, groupFill = TRUE)
```


But also a random imputation is not much better: it creates plenty of points in areas with no data
But also a random imputation is not much better: it creates plenty of points in areas with no data. Note how a simple check of the marginal distributions would not reveal this issue!

```{r}
dat_random_imp <- dat_with_miss
for (j in 1:dim(dat)[2]) {
dat_random_imp[,j] <- impute(dat_random_imp[,j],fun="random")
}
imp <- factor(pmax(mis_ind[,5],mis_ind[,6]),labels=c("Original","Imputed")) # point is imputed if x or y is imputed
ggplot(dat_random_imp) + geom_point(aes(x=x,y=y,color=imp))
p_random_imp <- ggplot(dat_random_imp) + geom_point(aes(x=x,y=y,color=imp))
ggMarginal(p_random_imp,groupColour = TRUE, groupFill = TRUE)
```

A cluster base on random imputation will thus not provide good results (even if we "know" the number of clusters)
A cluster base on random imputation will thus not provide good results (even if we "know" the number of clusters, which is 3 in this case). Note how the marginal distribution for y differs from the first chart of this vignette where we show the true clusters instead of the predicted clusters from a random imputation.

```{r}
tic("Clustering based on random imputation")
cl_compare <- KMeans_arma(data=dat_random_imp,clusters=3,n_iter=100,seed=751)
toc()
dat_random_imp$pred <- predict_KMeans(dat_random_imp,cl_compare)
ggplot(dat_random_imp) + geom_point(aes(x=x,y=y,color=factor(pred)))
p_random_imp <- ggplot(dat_random_imp) + geom_point(aes(x=x,y=y,color=factor(pred)))
ggMarginal(p_random_imp,groupColour = TRUE, groupFill = TRUE)
```


Expand Down Expand Up @@ -131,10 +138,11 @@ ClustImpute provides several results:
str(res)
```

We'll first look at the complete data and clustering results. Quite obviously, it gives better results then median / random imputation.
We'll first look at the complete data and clustering results. Quite obviously, it gives better results than median / random imputation.

```{r}
ggplot(res$complete_data,aes(x,y,color=factor(res$clusters))) + geom_point()
p_clustimpute <- ggplot(res$complete_data,aes(x,y,color=factor(res$clusters))) + geom_point()
ggMarginal(p_clustimpute,groupColour = TRUE, groupFill = TRUE)
```

Packages like MICE compute a traceplot of mean and variance for various chain. Here we only have a single realization and thus re-run ClustImpute with various seeds to obtain different realizations.
Expand Down Expand Up @@ -163,27 +171,6 @@ ggplot(as.data.frame(sd_all)) + geom_line(aes(x=iter,y=V1,color=factor(seed))) +

# Quality of imputation and cluster results

## Marginal distributions

We compare marginal distributions using a violin plot of x and y

```{r}
dat4plot <- dat
dat4plot$true_clust <- true_clust
Xfinal <- res$complete_data
Xfinal$pred <- res$clusters
par(mfrow=c(1,2))
violinBy(dat4plot,"x","true_clust",main="Original data")
violinBy(dat4plot,"y","true_clust",main="Original data")
violinBy(Xfinal,"x","pred",main="imputed data")
violinBy(Xfinal,"y","pred",main="imputed data")
violinBy(dat_random_imp,"x","pred",main="random imputation")
violinBy(dat_random_imp,"y","pred",main="random imputation")
```

In particular for y the distribution by cluster is quite far away from the original distribution for the random imputation based clustering.

## External validation: rand index

Below we compare the rand index between true and fitted cluster assignment. For all cases we obtain
Expand Down

0 comments on commit 4f0fadd

Please sign in to comment.