Question 1: Given some sample data, write a program to answer the following:

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

First, we need to ask ourselves why the Average Order Value was previously calculated incorrectly. By dividing the sum of the “order_amount” by the number of orders present across all 100 shops, we end up with a mean of $3145.13. However, the calculation is abnormally large for a relatively affordable item. This could be the result of extreme outliers that are skewing the data or potentially missing data points.

To better evaluate the order value, let's begin by importing the sneaker sample dataset and ensuring there are no missing data points.

data <- read_excel("~/Downloads/2019 Winter Data Science Intern Challenge Data Set.xlsx")
glimpse(data)

## Rows: 5,000
## Columns: 7
## $ order_id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ shop_id        <dbl> 53, 92, 44, 18, 18, 58, 87, 22, 64, 52, 66, 40, 54, 100…
## $ user_id        <dbl> 746, 925, 861, 935, 883, 882, 915, 761, 914, 788, 848, …
## $ order_amount   <dbl> 224, 90, 144, 156, 156, 138, 149, 292, 266, 146, 322, 3…
## $ total_items    <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 3, 2000, 1, 1…
## $ payment_method <chr> "cash", "cash", "cash", "credit_card", "credit_card", "…
## $ created_at     <dttm> 2017-03-13 12:36:56, 2017-03-03 17:38:51, 2017-03-14 0…

sum(is.na(data))

## [1] 0

Since no missing data points are evident, we can proceed by graphing the data on a boxplot to see how the number of orders and its respective "order_amount" is scattered.

ggplot(data=data, aes(x=shop_id,y=order_amount, group = 1)) +
  geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=2, notch=FALSE) +
  labs(title = "Boxplot", subtitle = "without outliers removed", x = " ", y = "Order Amount") +
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())

Reference for removing x-axis ticks

As the graph suggests, there appears to be a few transactions with an abnormaly large "order_amount" and most like would be responsible for skewing the data towards a higher mean. If these outliers were removed, would the boxplot suggest a lower average?

ggplot(data=data, aes(x=shop_id,y=order_amount, group = 1)) +
  geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=2, notch=FALSE) +
  coord_cartesian(ylim = quantile(data$order_amount, c(0.10, 0.90))) + 
  labs(title = "Boxplot", subtitle = "with outliers removed", x = " ", y = "Order Amount") +
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())

Reference for removing outliers from ggplot boxplot

When we remove the outliers from the graph, it's no wonder why the previously calculated mean of $3145.13 was so high. The boxplot also suggests that the average could be slightly lower than $300, a dramatic difference when compared to the initial calculation. The following code chunk will allow us to see the AOV for each shop in descending order to better discern what shops are skewing the data.

a <- aggregate(order_amount ~ shop_id, data, mean)
a[order(-a$order_amount),]

##     shop_id order_amount
## 42       42  235101.4902
## 78       78   49213.0435
## 50       50     403.5455
## 90       90     403.2245
## 38       38     390.8571
## 81       81     384.0000
## 6         6     383.5085
## 89       89     379.1475
## 33       33     376.2750
## 51       51     361.8043
...         ...    ...

On further examination, both shops 42 and 78 have a large AOV in comparison to the other 98 shops. Because of the presence of outliers, a better way to evaluate this data would be to find the median of all 5000 orders.

data_aov <- data %>%
  group_by(order_id) %>%
  summarize(aov = sum(order_amount)/(length(order_id)))

median(data_aov$aov)

## [1] 284

Ultimately, the median order value of the sneaker sample data is $284.

What metric would you report for this dataset?

Why the median? The median is the preferred measure of central tendency when there are extreme points in the data¹. If we were to use the mean, it would be heavily influenced by the extreme outliers in shops 42 and 78. By using the median as the metric, we are able to discern the “typical” value without having the scores of shops 42 and 78 skewing the value.

What is its value?

The median order value is $284, a more reasonable calculation for a relatively affordable item.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

q1.md

q1.md

Question 1: Given some sample data, write a program to answer the following:

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

What metric would you report for this dataset?

What is its value?

Files

q1.md

Latest commit

History

q1.md

File metadata and controls

Question 1: Given some sample data, write a program to answer the following:

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

What metric would you report for this dataset?

What is its value?