Skip to content

Latest commit

 

History

History
180 lines (136 loc) · 5.26 KB

p8105_hw1_jl4443.md

File metadata and controls

180 lines (136 loc) · 5.26 KB

P8105 Homework 1

Jessica Lavery 9/15/2019

Problem 1

# create data frame per specifications in HW1
df <- tibble(
  norm_samp = rnorm(8),
  sample_ge0 = (norm_samp >= 0), 
  character_vector = c("one", "two", "three", "four", "five", "six", "seven", "eight"),
  factor_vector = factor(x = c("a", "b", "c", "a", "a", "b", "c", "c"), levels = c("a", "b", "c"))
)

From the above data frame, we can try to take the mean of each type of variable (numeric, logical, character and factor). We see that the mean cannot be computed for character or factor varibles based on the warning messages below.

# try to take the mean of each variable
mean(pull(df, norm_samp))
## [1] 0.3220115
mean(pull(df, sample_ge0))
## [1] 0.625
mean(pull(df, character_vector)) 
## Warning in mean.default(pull(df, character_vector)): argument is not
## numeric or logical: returning NA

## [1] NA
mean(pull(df, factor_vector)) 
## Warning in mean.default(pull(df, factor_vector)): argument is not numeric
## or logical: returning NA

## [1] NA

Alternatively, we could try to coerce each non-numeric variable to a numeric variable to then try to compute the mean (output not shown). The logical variable is converted to a numeric variable with 0’s (FALSE) and 1’s (TRUE), and the mean can be computed. The function as.numeric explicitly converted the logical vector to a numeric vector, which appears to be what happened in the above exercise taking the mean of a logical vector automatically. However, the character and factor variables still produce a warning and a mean is not calculated (a value of NA is returned).

#coerce the logical variable to numeric and try to take the mean
mean(pull(df, as.numeric(sample_ge0)))

#coerce the character variable to numeric and try to take the mean
mean(pull(df, as.numeric(character_vector)))
## Warning in mean.default(pull(df, as.numeric(character_vector))): argument
## is not numeric or logical: returning NA
#coerce the factor variable to numeric and try to take the mean
mean(pull(df, as.numeric(factor_vector)))
## Warning in mean.default(pull(df, as.numeric(factor_vector))): argument is
## not numeric or logical: returning NA

We can convert the logical vector and try to perform some operations. When the logical vector is converted to a numeric vector, we are able to multiply it by the mean of the random sample. When the logical vector is converted to a factor, we cannot multiply the factor vector by the mean of the random sample. However, converting the logical vector to a factor vector and then to a numeric vector does allow us to perform a mathematical operation on the vector by multiplying it by the mean of the result.

#convert the logical vector to numeric and multiply the random sample by the result.
as.numeric(pull(df,sample_ge0)) * pull(df, norm_samp)
## [1] 0.0000000 1.0158786 0.0000000 1.5339783 0.3801067 0.5897175 0.0000000
## [8] 1.5308602
#convert the logical vector to a factor, and multiply the random sample by the result.
as.factor(pull(df,sample_ge0)) * pull(df, norm_samp)
## Warning in Ops.factor(as.factor(pull(df, sample_ge0)), pull(df,
## norm_samp)): '*' not meaningful for factors

## [1] NA NA NA NA NA NA NA NA
#convert the logical vector to a factor and then convert the result to numeric, and multiply the random sample by the result
as.numeric(as.factor(pull(df,sample_ge0))) * pull(df, norm_samp)
## [1] -0.5426848  2.0317572 -0.9507415  3.0679566  0.7602133  1.1794350
## [7] -0.9810228  3.0617204

Problem 2

df2 <- tibble(
  x = rnorm(500),
  y = rnorm(500),
  x_plus_y_gt1_logical = (x + y > 1),
  x_plus_y_gt1_numeric = as.numeric(x_plus_y_gt1_logical),
  x_plus_y_gt1_factor = as.factor(x_plus_y_gt1_logical)
)

The dataset includes 5 variables with 500 observations. Variable x has mean 0.002953 and standard deviation 0.982574. From the random samples, 20.2% of instances have x + y > 1.

Figure 1

ggplot(data = df2, aes(x = x, y = y, color = x_plus_y_gt1_logical)) +
  geom_point()

#export to project directory
ggsave(filename = "hw1_figure1_logical_color_scale.png")
## Saving 7 x 5 in image

Using a logical factor to create the color scale results in two distinct colors: one color per each logical response (TRUE/FALSE).

Figure 2

ggplot(data = df2, aes(x = x, y = y, color = x_plus_y_gt1_numeric)) +
  geom_point()

Using a numeric variable to indicate the color of the scatterplot results in a color gradient spanning the range of the variable, despite the fact that only the two extreme colors show up on the actual plot because there are no instances of the variable having a value between 0 and 1.

Figure 3

ggplot(data = df2, aes(x = x, y = y, color = x_plus_y_gt1_factor)) +
  geom_point()

Using a factor variable to determine the color of the points on the scatterplot is similar to using the logical variable to indicate the color. There is one color per level of the factor (TRUE, FALSE).