CH_features_extraction.Rmd

---
title: "Coronal Holes data features extraction"
author: "Dmitry A. Grechka"
date: "May 27, 2016"
output:
  html_document:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_chunk$set(cache = TRUE)
```

# Intro

I am going to perform data clean up and feature extraction for Solar wind model fitting.
The major predictor of the solar wind is considered to be coronal holes characteristics (e.g. see <https://www.cfa.harvard.edu/~scranmer/Preprints/cranmer_y10pre.pdf>)

# Available data sets

I've got two CSV data sets to be used as training data for the modelling. (See acknowledgements section for data origin)
The copies of the files are awailable at <https://github.com/dgrechka/SolarWindVelocity/tree/master/SampleData>.
They contain quantitative features extracted from the Sun images using computer vision algorithms.

The files are following:

  * ch_2015_0193.dsv - green spectrum portion originated features
  * ch_2015_0211.dsv - red spectrum portion originated features
  
There are blue spectrum (171) images, but there are no extracted features from them.

First, we will analyse features extracted from green images ch_2015_0193.dsv

```{r data load}
ch_2015_0193 <- read.csv('SampleData/ch_2015_0193.dsv',
                         sep = '\t',
                         dec = ',',
                        colClasses=c("dt_record"="character")
                         )
```

We need to parse time field to numeric axis. I will use the hours offset from the beginning of the year.

```{r time_parsing}
base_ts_sec <- as.numeric(strptime("2015-01-01T00:00:00","%Y-%m-%dT%H:%M:%S",tz='UTC'))

ch_2015_0193$dt_record <- strptime(ch_2015_0193$dt_record,"%Y-%m-%dT%H:%M:%S",tz='UTC')
ch_2015_0193$ts_sec <- as.numeric(ch_2015_0193$dt_record)
ch_2015_0193$ts <- (ch_2015_0193$ts_sec - base_ts_sec)/3600;#hours

```

# Data exploration

We will use the following libraries during the analysis

```{r dependencies,message=FALSE}
library(caret)
library(oce)
library(pracma)
library(corrplot)
library(ggplot2)
```

Setting random seed for reproducibility

```{r repro}
set.seed(5)
```

## Overview

We have total of **`r ncol(ch_2015_0193)-3`** possible predictors.

```{r data_overview}
str(ch_2015_0193)
```


## Data cleanup

### Low variable features
We will exclude the variables that vary poorly.

```{r near_zero_vars}
nzv_stats <- nearZeroVar(ch_2015_0193,saveMetrics = T)
```
These variables do not vary:
```{r near_zero_vars2}
nzv_stats[nzv_stats$nzv,]
```

Leaving only varying variables for further analysis.
```{r near_zero_vars3}
nzv <- nearZeroVar(ch_2015_0193,saveMetrics = F)
ch_2015_0193 <- ch_2015_0193[,index2vec(nzv,vars=ncol(ch_2015_0193)) != 1]
```

### Missing Values handling
For this study we will simply omit the observations with missing values

Before omitting there were **`r nrow(ch_2015_0193)`** observations.

```{r missing values}
ch_2015_0193 <- na.omit(ch_2015_0193)
```

After omitting there are **`r nrow(ch_2015_0193)`** observations left.

## Exploring the correlations

```{r correlations}
numeric_vars <- colnames(ch_2015_0193[,!(names(ch_2015_0193) %in% c('fitsname','dt_record','ts'))]) # omitting time axis and origin filenames

corM <- cor(ch_2015_0193[,names(ch_2015_0193) %in% numeric_vars],use = 'complete.obs')
varMcol<- colorRampPalette(c("red", "white", "blue"))(20)
heatmap(x = corM, col = varMcol, symm = TRUE)
#corrplot(corM,method = 'pie',order = 'hclust')
```


## Exploring value densities

```{r densities plots}
par(mfrow=c(1,2))
for(name in numeric_vars) {
  plot(density(ch_2015_0193[[name]],na.rm=T),main = paste('Density of',name))
}
```

## Data transforms

### Log transforms

Some of the variables are highly skewed. We will log-transform these variables.

```{r transforms}
varsToLogNames <- c('CHArea','CHMaxIntensity','CHVarIntensity','CHPerimeter','CorrectAlongLat','CorrectAlongLon','RightSideCHMaxInt','AboveCHMaxInt','BelowCHMaxInt','RightSideCHMeanInt','BelowCHMeanInt')
for(name in varsToLogNames) {
  par(mfrow=c(1,2)) 
  ch_2015_0193[[paste0('log_',name)]] <- log(ch_2015_0193[[name]])
  numeric_vars <- c(numeric_vars,paste0('log_',name))
  plot(density(ch_2015_0193[[name]],na.rm=T),main ='Before',ylab=paste('Density of ',name))
  plot(density(ch_2015_0193[[paste0('log_',name)]],na.rm=T),main ='After',ylab=paste0('Density of log_',name))
}

ch_2015_0193 <- ch_2015_0193[,!(names(ch_2015_0193) %in% varsToLogNames)]
numeric_vars <- numeric_vars[!(numeric_vars %in% varsToLogNames)]
```

## Exploring timeseries

### Subsetting the training dataset for visual exploration

For basic visual time-series explorations we will investigate first 2 month of data

```{r shrinking}
exploratoryOffsetTreshold = 1440 #hours

ch_2015_0193_short <- ch_2015_0193[ch_2015_0193$ts <= exploratoryOffsetTreshold,]
```

### Noisey varaibles

Some of the variables are consistent in time while other contain lot of noise.

We consider the following variables as too noisy, we will not use them as features for now.

```{r noisy_plots}
ts_var <- ch_2015_0193_short$ts
ts_var_len <- length(ch_2015_0193_short)

consistent_variables <- c('ImageMeanValue','SunMeanIntensity','SunVarIntensity','Total_Nubmers_of_CH','Total_Area_CH','Relative_CH_Area','Total_CorrectSphereArea_CH','Relative_CH_CorrectSphereArea')

noisy_vars <- numeric_vars[!(numeric_vars %in% consistent_variables)]

par(mfrow=c(2,1)) 

for(name in noisy_vars) {
  print(
    ggplot(ch_2015_0193_short, aes_string('ts', name)) +
      geom_point(shape=1)
  )
}
```

### Consistent variables

We transform consistent variables to reduce noise by removing spikes.
Spikes are removed with running median.
Assuming that they are caused by inconsistency of computer vision analysis of the Sun photo.

```{r runmed}

for(name in consistent_variables) {
  variable <- ch_2015_0193_short[[name]]
  variable <- runmed(variable, k=25)
  ch_2015_0193_short[[paste0(name,'_m')]] <- variable
}
```

##Consistent variables correlations

```{r correlations2}

consistent_variables_approx <- as.character(sapply(consistent_variables, function(x) {return(paste0(x,'_m'))}))

corM <- cor(ch_2015_0193_short[,names(ch_2015_0193_short) %in% consistent_variables_approx],use = 'complete.obs')
corrplot(corM,method = 'pie',order = 'hclust')
```

As *Total_Area_CH_m* is fully correlated with *Relative_CH_Area_m* and *Total_CorrectShpereArea_CH_m* is fully correlated with *Relative_CH_CorrectSphereArea_m* we will leave only *Relative* variables.

```{r relative_consistent}
consistent_variables <- consistent_variables[!(consistent_variables %in% c('Total_Area_CH','Total_CorrectSphereArea_CH'))]
```

Finally we have **`r length(consistent_variables)`** predictors that can be used for model fitting.
(Time series are visualized later)

# Conclusion

## Processing another spectrum data

Declaring the function to perform all actions that we performed on initial data set.
```{r preprocessing}
chPreprocess <- function (data) {
  data2 <- data
  #creating time axis
  data2$dt_record <- strptime(data2$dt_record,"%Y-%m-%dT%H:%M:%S",tz='UTC')
  data2$ts_sec <- as.numeric(data2$dt_record)
  data2$ts <- (data2$ts_sec -base_ts_sec)/3600;#hours since basetime
  data2 <- data2[,names(data2) %in% c('ts',consistent_variables)]
  #filtering out MVs
  data2 <- na.omit(data2)
  
  #appling unning median to approximate and remove spikes
  result <- data.frame(ts=data2$ts)
  for(name in consistent_variables) {
    variable <- data2[[name]]
    result[[name]] <- variable
    variable_m <- runmed(variable, k=25)
    result[[paste0(name,'_m')]] <- variable_m
  }
  return (result)
}
```

Appling this function to the red image originated data.

```{r 211_processing}
ch_2015_0211 <- read.csv('SampleData/ch_2015_0211.dsv',
                         sep = '\t',
                         dec = ',',
                        colClasses=c("dt_record"="character"))

ch_2015_0211<- chPreprocess(ch_2015_0211)

ch_2015_0211_short <- ch_2015_0211[ch_2015_0211$ts <= exploratoryOffsetTreshold,]
```

Averaging the properties of two spectra: fitting cubic splines through variables, calculating average values across two splines

```{r joining_predictors}
chJoin <- function(data193,data211) {
  left <- min(data193$ts,data211$ts)
  right <- max(data193$ts,data211$ts)
  knots <- seq(from = left, to = right,by=1)
  
  results <- data.frame(ts=knots)
    
  for(name in consistent_variables) {
    name_m <- paste0(name,'_m')
    sp1 <- smooth.spline(data193$ts,data193[[name_m]])
    sp2 <- smooth.spline(data211$ts,data211[[name_m]])
    synth_1<-predict(sp1, knots)$y
    synth_2<-predict(sp2, knots)$y
    synth_avg <- (synth_1+synth_2)*0.5
    
    results[[paste0(name,'_spline_193')]] <- synth_1
    results[[paste0(name,'_spline_211')]] <- synth_2
    results[[paste0(name,'_j')]] <- synth_avg
  }
  
  return(results);
}

joint_short <- chJoin(ch_2015_0193_short,ch_2015_0211_short)

```

## Exploring results
The resulting variables are following

```{r variation in time}

varsToAvg <- c('Total_Nubmers_of_CH','Relative_CH_Area','Relative_CH_CorrectSphereArea')

color_scale <- c('col_193_apx'='red','col_211_apx'='green','col_avg'='blue')
color_scale_labels <- c('col_193_apx'='approximated (193nm)','col_211_apx'='approximated (211nm)','col_avg'='average')
shape_scale <- c('shape_193_raw'=1,'shape_211_raw'=3)
shape_scale_labels <- c('shape_193_raw'='raw data (193nm)','shape_211_raw'='raw data (211nm)')

for(name in consistent_variables) {
  p <- ggplot(ch_2015_0193_short) +
       scale_colour_manual(values=color_scale,labels=color_scale_labels)+ 
       scale_shape_manual(values=shape_scale,labels=shape_scale_labels)+
      geom_point(aes_string('ts',name,shape='"shape_193_raw"'))+
      geom_point(data=ch_2015_0211_short,aes_string('ts',name,shape='"shape_211_raw"'))+
      geom_line(aes_string('ts',paste0(name,'_m'),colour='"col_193_apx"'),lwd=1)+
      geom_line(data=ch_2015_0211_short,aes_string('ts',paste0(name,'_m'),colour='"col_211_apx"'),lwd=0.5)
  if(name %in% varsToAvg) {
    p <- p +geom_line(data=joint_short,aes_string('ts',paste0(name,'_j'),colour='"col_avg"'),lwd=1)
  }
  print(p)
}
```

## Saving cleaned tranformed data

Processing and visualizing whole year data

(reloading 193 again)
```{r outputing results}
ch_2015_0193 <- read.csv('SampleData/ch_2015_0193.dsv',
                         sep = '\t',
                         dec = ',',
                        colClasses=c("dt_record"="character"))
ch_2015_0193<- chPreprocess(ch_2015_0193)
joint <- chJoin(ch_2015_0193,ch_2015_0211)
summary(joint)
nrow(joint)

for(name in consistent_variables) {
  p <- ggplot(ch_2015_0193) +
       scale_colour_manual(values=color_scale,labels=color_scale_labels)+ 
       scale_shape_manual(values=shape_scale,labels=shape_scale_labels)+
      geom_point(aes_string('ts',name,shape='"shape_193_raw"'))+
      geom_point(data=ch_2015_0211,aes_string('ts',name,shape='"shape_211_raw"'))+
      geom_line(aes_string('ts',paste0(name,'_m'),colour='"col_193_apx"'),lwd=1)+
      geom_line(data=ch_2015_0211,aes_string('ts',paste0(name,'_m'),colour='"col_211_apx"'),lwd=0.5)
  if(name %in% varsToAvg) {
    p <- p +geom_line(data=joint,aes_string('ts',paste0(name,'_j'),colour='"col_avg"'),lwd=1)
  }
  print(p)
}

```

Exporting whole year data

```{r export}
write.csv(file='ResultData/CH_features_0211_cleaned_2015.csv',row.names = F,
          ch_2015_0211)
write.csv(file='ResultData/CH_features_0193_cleaned_2015.csv',row.names = F,
          ch_2015_0193)

write.csv(file='ResultData/CH_features_cleaned_2015.csv',row.names = F,
          data.frame(
            ts=joint$ts,
            Relative_CH_CorrectSphereArea_j=joint$Relative_CH_CorrectSphereArea_j,
            Relative_CH_CorrectSphereArea_spline_211=joint$Relative_CH_CorrectSphereArea_spline_211
            ))
```

This report can be generated by processing Rmd file published at <https://github.com/dgrechka/SolarWindVelocity>

# Acknowledgments

I want to thank Skobeltsyn Institute of Nuclear Physics of Moscow State University <http://swx.sinp.msu.ru> that publishes observation data on space whether and especially Lucy Mukhametdinova from SINP MSU who gave me the data in convenient CSV form