This repository has been archived by the owner on Sep 17, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCH_features_extraction.Rmd
371 lines (273 loc) · 12 KB
/
CH_features_extraction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Coronal Holes data features extraction"
author: "Dmitry A. Grechka"
date: "May 27, 2016"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_chunk$set(cache = TRUE)
```
# Intro
I am going to perform data clean up and feature extraction for Solar wind model fitting.
The major predictor of the solar wind is considered to be coronal holes characteristics (e.g. see <https://www.cfa.harvard.edu/~scranmer/Preprints/cranmer_y10pre.pdf>)
# Available data sets
I've got two CSV data sets to be used as training data for the modelling. (See acknowledgements section for data origin)
The copies of the files are awailable at <https://github.com/dgrechka/SolarWindVelocity/tree/master/SampleData>.
They contain quantitative features extracted from the Sun images using computer vision algorithms.
The files are following:
* ch_2015_0193.dsv - green spectrum portion originated features
* ch_2015_0211.dsv - red spectrum portion originated features
There are blue spectrum (171) images, but there are no extracted features from them.
First, we will analyse features extracted from green images ch_2015_0193.dsv
```{r data load}
ch_2015_0193 <- read.csv('SampleData/ch_2015_0193.dsv',
sep = '\t',
dec = ',',
colClasses=c("dt_record"="character")
)
```
We need to parse time field to numeric axis. I will use the hours offset from the beginning of the year.
```{r time_parsing}
base_ts_sec <- as.numeric(strptime("2015-01-01T00:00:00","%Y-%m-%dT%H:%M:%S",tz='UTC'))
ch_2015_0193$dt_record <- strptime(ch_2015_0193$dt_record,"%Y-%m-%dT%H:%M:%S",tz='UTC')
ch_2015_0193$ts_sec <- as.numeric(ch_2015_0193$dt_record)
ch_2015_0193$ts <- (ch_2015_0193$ts_sec - base_ts_sec)/3600;#hours
```
# Data exploration
We will use the following libraries during the analysis
```{r dependencies,message=FALSE}
library(caret)
library(oce)
library(pracma)
library(corrplot)
library(ggplot2)
```
Setting random seed for reproducibility
```{r repro}
set.seed(5)
```
## Overview
We have total of **`r ncol(ch_2015_0193)-3`** possible predictors.
```{r data_overview}
str(ch_2015_0193)
```
## Data cleanup
### Low variable features
We will exclude the variables that vary poorly.
```{r near_zero_vars}
nzv_stats <- nearZeroVar(ch_2015_0193,saveMetrics = T)
```
These variables do not vary:
```{r near_zero_vars2}
nzv_stats[nzv_stats$nzv,]
```
Leaving only varying variables for further analysis.
```{r near_zero_vars3}
nzv <- nearZeroVar(ch_2015_0193,saveMetrics = F)
ch_2015_0193 <- ch_2015_0193[,index2vec(nzv,vars=ncol(ch_2015_0193)) != 1]
```
### Missing Values handling
For this study we will simply omit the observations with missing values
Before omitting there were **`r nrow(ch_2015_0193)`** observations.
```{r missing values}
ch_2015_0193 <- na.omit(ch_2015_0193)
```
After omitting there are **`r nrow(ch_2015_0193)`** observations left.
## Exploring the correlations
```{r correlations}
numeric_vars <- colnames(ch_2015_0193[,!(names(ch_2015_0193) %in% c('fitsname','dt_record','ts'))]) # omitting time axis and origin filenames
corM <- cor(ch_2015_0193[,names(ch_2015_0193) %in% numeric_vars],use = 'complete.obs')
varMcol<- colorRampPalette(c("red", "white", "blue"))(20)
heatmap(x = corM, col = varMcol, symm = TRUE)
#corrplot(corM,method = 'pie',order = 'hclust')
```
## Exploring value densities
```{r densities plots}
par(mfrow=c(1,2))
for(name in numeric_vars) {
plot(density(ch_2015_0193[[name]],na.rm=T),main = paste('Density of',name))
}
```
## Data transforms
### Log transforms
Some of the variables are highly skewed. We will log-transform these variables.
```{r transforms}
varsToLogNames <- c('CHArea','CHMaxIntensity','CHVarIntensity','CHPerimeter','CorrectAlongLat','CorrectAlongLon','RightSideCHMaxInt','AboveCHMaxInt','BelowCHMaxInt','RightSideCHMeanInt','BelowCHMeanInt')
for(name in varsToLogNames) {
par(mfrow=c(1,2))
ch_2015_0193[[paste0('log_',name)]] <- log(ch_2015_0193[[name]])
numeric_vars <- c(numeric_vars,paste0('log_',name))
plot(density(ch_2015_0193[[name]],na.rm=T),main ='Before',ylab=paste('Density of ',name))
plot(density(ch_2015_0193[[paste0('log_',name)]],na.rm=T),main ='After',ylab=paste0('Density of log_',name))
}
ch_2015_0193 <- ch_2015_0193[,!(names(ch_2015_0193) %in% varsToLogNames)]
numeric_vars <- numeric_vars[!(numeric_vars %in% varsToLogNames)]
```
## Exploring timeseries
### Subsetting the training dataset for visual exploration
For basic visual time-series explorations we will investigate first 2 month of data
```{r shrinking}
exploratoryOffsetTreshold = 1440 #hours
ch_2015_0193_short <- ch_2015_0193[ch_2015_0193$ts <= exploratoryOffsetTreshold,]
```
### Noisey varaibles
Some of the variables are consistent in time while other contain lot of noise.
We consider the following variables as too noisy, we will not use them as features for now.
```{r noisy_plots}
ts_var <- ch_2015_0193_short$ts
ts_var_len <- length(ch_2015_0193_short)
consistent_variables <- c('ImageMeanValue','SunMeanIntensity','SunVarIntensity','Total_Nubmers_of_CH','Total_Area_CH','Relative_CH_Area','Total_CorrectSphereArea_CH','Relative_CH_CorrectSphereArea')
noisy_vars <- numeric_vars[!(numeric_vars %in% consistent_variables)]
par(mfrow=c(2,1))
for(name in noisy_vars) {
print(
ggplot(ch_2015_0193_short, aes_string('ts', name)) +
geom_point(shape=1)
)
}
```
### Consistent variables
We transform consistent variables to reduce noise by removing spikes.
Spikes are removed with running median.
Assuming that they are caused by inconsistency of computer vision analysis of the Sun photo.
```{r runmed}
for(name in consistent_variables) {
variable <- ch_2015_0193_short[[name]]
variable <- runmed(variable, k=25)
ch_2015_0193_short[[paste0(name,'_m')]] <- variable
}
```
##Consistent variables correlations
```{r correlations2}
consistent_variables_approx <- as.character(sapply(consistent_variables, function(x) {return(paste0(x,'_m'))}))
corM <- cor(ch_2015_0193_short[,names(ch_2015_0193_short) %in% consistent_variables_approx],use = 'complete.obs')
corrplot(corM,method = 'pie',order = 'hclust')
```
As *Total_Area_CH_m* is fully correlated with *Relative_CH_Area_m* and *Total_CorrectShpereArea_CH_m* is fully correlated with *Relative_CH_CorrectSphereArea_m* we will leave only *Relative* variables.
```{r relative_consistent}
consistent_variables <- consistent_variables[!(consistent_variables %in% c('Total_Area_CH','Total_CorrectSphereArea_CH'))]
```
Finally we have **`r length(consistent_variables)`** predictors that can be used for model fitting.
(Time series are visualized later)
# Conclusion
## Processing another spectrum data
Declaring the function to perform all actions that we performed on initial data set.
```{r preprocessing}
chPreprocess <- function (data) {
data2 <- data
#creating time axis
data2$dt_record <- strptime(data2$dt_record,"%Y-%m-%dT%H:%M:%S",tz='UTC')
data2$ts_sec <- as.numeric(data2$dt_record)
data2$ts <- (data2$ts_sec -base_ts_sec)/3600;#hours since basetime
data2 <- data2[,names(data2) %in% c('ts',consistent_variables)]
#filtering out MVs
data2 <- na.omit(data2)
#appling unning median to approximate and remove spikes
result <- data.frame(ts=data2$ts)
for(name in consistent_variables) {
variable <- data2[[name]]
result[[name]] <- variable
variable_m <- runmed(variable, k=25)
result[[paste0(name,'_m')]] <- variable_m
}
return (result)
}
```
Appling this function to the red image originated data.
```{r 211_processing}
ch_2015_0211 <- read.csv('SampleData/ch_2015_0211.dsv',
sep = '\t',
dec = ',',
colClasses=c("dt_record"="character"))
ch_2015_0211<- chPreprocess(ch_2015_0211)
ch_2015_0211_short <- ch_2015_0211[ch_2015_0211$ts <= exploratoryOffsetTreshold,]
```
Averaging the properties of two spectra: fitting cubic splines through variables, calculating average values across two splines
```{r joining_predictors}
chJoin <- function(data193,data211) {
left <- min(data193$ts,data211$ts)
right <- max(data193$ts,data211$ts)
knots <- seq(from = left, to = right,by=1)
results <- data.frame(ts=knots)
for(name in consistent_variables) {
name_m <- paste0(name,'_m')
sp1 <- smooth.spline(data193$ts,data193[[name_m]])
sp2 <- smooth.spline(data211$ts,data211[[name_m]])
synth_1<-predict(sp1, knots)$y
synth_2<-predict(sp2, knots)$y
synth_avg <- (synth_1+synth_2)*0.5
results[[paste0(name,'_spline_193')]] <- synth_1
results[[paste0(name,'_spline_211')]] <- synth_2
results[[paste0(name,'_j')]] <- synth_avg
}
return(results);
}
joint_short <- chJoin(ch_2015_0193_short,ch_2015_0211_short)
```
## Exploring results
The resulting variables are following
```{r variation in time}
varsToAvg <- c('Total_Nubmers_of_CH','Relative_CH_Area','Relative_CH_CorrectSphereArea')
color_scale <- c('col_193_apx'='red','col_211_apx'='green','col_avg'='blue')
color_scale_labels <- c('col_193_apx'='approximated (193nm)','col_211_apx'='approximated (211nm)','col_avg'='average')
shape_scale <- c('shape_193_raw'=1,'shape_211_raw'=3)
shape_scale_labels <- c('shape_193_raw'='raw data (193nm)','shape_211_raw'='raw data (211nm)')
for(name in consistent_variables) {
p <- ggplot(ch_2015_0193_short) +
scale_colour_manual(values=color_scale,labels=color_scale_labels)+
scale_shape_manual(values=shape_scale,labels=shape_scale_labels)+
geom_point(aes_string('ts',name,shape='"shape_193_raw"'))+
geom_point(data=ch_2015_0211_short,aes_string('ts',name,shape='"shape_211_raw"'))+
geom_line(aes_string('ts',paste0(name,'_m'),colour='"col_193_apx"'),lwd=1)+
geom_line(data=ch_2015_0211_short,aes_string('ts',paste0(name,'_m'),colour='"col_211_apx"'),lwd=0.5)
if(name %in% varsToAvg) {
p <- p +geom_line(data=joint_short,aes_string('ts',paste0(name,'_j'),colour='"col_avg"'),lwd=1)
}
print(p)
}
```
## Saving cleaned tranformed data
Processing and visualizing whole year data
(reloading 193 again)
```{r outputing results}
ch_2015_0193 <- read.csv('SampleData/ch_2015_0193.dsv',
sep = '\t',
dec = ',',
colClasses=c("dt_record"="character"))
ch_2015_0193<- chPreprocess(ch_2015_0193)
joint <- chJoin(ch_2015_0193,ch_2015_0211)
summary(joint)
nrow(joint)
for(name in consistent_variables) {
p <- ggplot(ch_2015_0193) +
scale_colour_manual(values=color_scale,labels=color_scale_labels)+
scale_shape_manual(values=shape_scale,labels=shape_scale_labels)+
geom_point(aes_string('ts',name,shape='"shape_193_raw"'))+
geom_point(data=ch_2015_0211,aes_string('ts',name,shape='"shape_211_raw"'))+
geom_line(aes_string('ts',paste0(name,'_m'),colour='"col_193_apx"'),lwd=1)+
geom_line(data=ch_2015_0211,aes_string('ts',paste0(name,'_m'),colour='"col_211_apx"'),lwd=0.5)
if(name %in% varsToAvg) {
p <- p +geom_line(data=joint,aes_string('ts',paste0(name,'_j'),colour='"col_avg"'),lwd=1)
}
print(p)
}
```
Exporting whole year data
```{r export}
write.csv(file='ResultData/CH_features_0211_cleaned_2015.csv',row.names = F,
ch_2015_0211)
write.csv(file='ResultData/CH_features_0193_cleaned_2015.csv',row.names = F,
ch_2015_0193)
write.csv(file='ResultData/CH_features_cleaned_2015.csv',row.names = F,
data.frame(
ts=joint$ts,
Relative_CH_CorrectSphereArea_j=joint$Relative_CH_CorrectSphereArea_j,
Relative_CH_CorrectSphereArea_spline_211=joint$Relative_CH_CorrectSphereArea_spline_211
))
```
This report can be generated by processing Rmd file published at <https://github.com/dgrechka/SolarWindVelocity>
# Acknowledgments
I want to thank Skobeltsyn Institute of Nuclear Physics of Moscow State University <http://swx.sinp.msu.ru> that publishes observation data on space whether and especially Lucy Mukhametdinova from SINP MSU who gave me the data in convenient CSV form