The American automotive industry is known to be one of the largest and most competitive markets globally, where General Motors Company is among the leading players who have been competing against each other for decades. This project aims to perform a time series analysis of the General Motors Company monthly sales in the US market, for the past 2 decades which will provide insights into their market position, allowing us to uncover patterns, trends, and fluctuations within sequential data, enabling a deeper understanding of temporal relationships. It allows us to make predictions based on historical patterns, facilitating informed decision-making and strategic planning. This analytical approach is particularly valuable in forecasting future trends, identifying anomalies, and extracting meaningful insights from time-ordered datasets.
The main objectives of this project are:
-
To analyze the sales performance of General Motors in the US market over the last two decades.
-
To identify trends, patterns and seasonality in the sales data of the company.
-
To recognize any of the factors that may have contributed to the increase or decrease in sales performance.
-
To fit a model on the dataset and make predictions for forthcoming data.
The data used for this project is the monthly sales data for the General Motors Company in the US market.
General Motors Sales Data (click here for the data)
To evaluate our model's predictions and compare them with actual values, we temporarily exclude the most recent 24 months. Subsequently, we will add them later on.
There is a change in trend and seasonality to be seen from the plot (fig. 2). As a result, the time series of sales does not appear to be stationary. A stationary time series is one whose statistical properties, such as the mean and variance, do not change over time, while a non-stationary time series has statistical properties that are time-dependent and can vary over time.
Furthermore, we can proceed with the ADF test.
Clearly, from the ADF test result, the
Plotting the ACF and PACF graphs can also help understanding the time series better. In time series analysis, ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots are often used to determine the order of Autoregressive (AR) and Moving Average (MA) models.
The ACF plot displays the relationship between the values of a time series and its lagged values. The ACF plot can be used to identify the order of an MA model by observing the number of lags with significant autocorrelation.
The PACF plot displays the relationship between a time series and its lag values. The PACF plot can be used to identify the order of an AR model by observing the number of lags with significant partial autocorrelation.
The next step is taking the logarithm of the time series. This is a common data transformation technique used in time series analysis. In our data set, we notice a variability in the data overtime, which makes it difficult to forecast the series accurately. Taking the logarithm of the series can help to stabilize the variance, by compressing the range of values for large observations and expanding the range for small observations.
Transformations such as logarithms can help to stabilize the variance of a time series. But when it comes to the mean of time series, differencing can help stabilize the mean of a time series by removing changes in the level of a time series and therefore eliminating (or reducing) trend and seasonality.
The differenced series is the change between consecutive observations in the original series and can be written as
where
At times, the differenced data might not appear to be stationary, and it might be required to perform a second differencing to achieve a stationary series:
In our case, that step is not necessary.
Furthermore, in the case of a stationary time series at
Seasonality in time series refers to the pattern of regular and predictable fluctuations that occur over fixed intervals of time, such as days, weeks, months, or years.
In our case, the PACF shows a spike at lag 12, which may suggest seasonality and the ACF shows a strong correlation at lags 12,24,36 (fig.3 and fig.6).
If a time series is seasonal at lag = 12, it means that there is a repeating pattern or regular fluctuation in the data that occurs every 12 time units. In the context of monthly data, a seasonal pattern at lag = 12 indicates a yearly seasonality, where the same or similar pattern tends to repeat every 12 months. Therefore, a seasonal pattern at lag = 12 suggests that the time series exhibits a recurring behavior on an annual basis. Understanding seasonality at lag = 12 allows to account for these recurring patterns when developing forecasting models.
Let's look at a seasonal sales boxplot to understand better the seasonality.
This graph shows a visual of any recurring seasonal patterns or fluctuations. For example, the sales seem to be low during March, but they increase in April, then slightly decrease until August, when they increase back, and so on.
If
We run the ADF test again to make sure that now we have a stationary series.
From the test,
Clearly, the time series represents a SARIMA
-
$p$ and seasonal$P$ indicate the autoregressive order. -
$d$ and seasonal$D$ indicate differencing that must be done to stationarize series. -
$q$ and$Q$ indicate the moving average order.
By definition, if
-
AR:
$\phi(z) = 1 - \phi_1 z - \ldots - \phi_pz^p$ -
MA:
$\theta(z) = 1 + \theta_1z+ \ldots +\theta_qz^q$
and the seasonal components are
-
seasonal AR:
$\Phi(z) = 1 - \Phi_1 z - \ldots - \Phi_pz^p$ -
seasonal MA:
$\Theta(z) = 1 + \Theta_1z+ \ldots + \Theta_qz^q$ .
For our both time series,
From the ACF and PACF plots, for the sales time series, we notice a significant spike on the ACF plot before the first seasonal
However, in this case, basing my analysis only on the ACF and PACF plots may result in uncertain and insufficient conclusions.
To determine the best
# Load the necessary packages
library(forecast)
library(ggplot2)
# Choosing the best model for GM Time Series
# Set values for d, D, and s
d <- 1
D <- 1
s <- 12
# Initialize variables to store minimum AIC and corresponding parameters
min_aic_gm <- Inf
best_params_gm <- c()
# Loop over values of p, q, P, and Q
for (p in 1:3) {
for (q in 1:3) {
for (P in 1:3) {
for (Q in 1:3) {
# Check if the sum of parameters is less than or equal to 10
if (p + d + q + P + D + Q <= 10) {
# Fit ARIMA model
model_gm <- try(arima(log(gm_ts), order=c(p-1, d, q-1),
seasonal=list(order=c(P-1, D, Q-1),period=s)), silent = TRUE)
# Check for errors in model fitting
if (class(model_gm) == "try-error") {
next
}
# Perform Ljung-Box test on residuals of fitted model
test <- Box.test(model_gm$residuals, lag=log(length(model_gm$residuals)))
# Calculate SSE of fitted model
sse <- sum(model_gm$residuals^2)
# Update minimum AIC and corresponding parameters if a better model is found
if (model_gm$aic < min_aic_gm) {
min_aic_gm <- model_gm$aic
best_params_gm <- c(p-1, d, q-1, P-1, D, Q-1)
}
# Print AIC, SSE, and p-value of fitted model
cat(p-1, d, q-1, P-1, D, Q-1, "AIC:", model_gm$aic, "SSE:", sse,
"p-value:", test$p.value, "\n")
}
}
}
}
}
# Print the best parameters and minimum AIC
cat("Best Parameters: p, d, q, P, D, Q =", best_params_gm, "\n")
cat("Minimum AIC =", min_aic_gm, "\n")
The output of the code includes the AIC, SSE, and the p-value of each fitted model. The AIC (Akaike Information Criterion) is calculated as
The Ljung-Box test is a statistical test for autocorrelation in a time series. The test is used to assess whether the residuals (i.e., the differences between the observed values and the values predicted by a model) of a time series model are independently distributed. Therefore, the null hypothesis of the Ljung-Box test is that the residuals of the fitted model are independently distributed.
Regarding the p-value of the Ljung-Box test, if the p-value of the Ljung-Box test is small ( less than 0.05), this provides evidence against the null hypothesis. Hence, a small p-value suggests that the model may not be a good fit for the data.
Conversely, if the p-value of the Ljung-Box test is large (greater than 0.05), this suggests that there is no evidence against the null hypothesis and that the residuals are independently distributed. In this case, the model is likely a good fit for the data.
From the results of the code above, we choose the best fitting model for the time series to be
For a SARIMA$(0,1,2)x(0,1,1)_{12}$, the equation would be
Therefore, for the GM model, the equation would be
where
As mentioned above, residuals in time series are what is left over after fitting a model. For most of the time series models, residuals are equal to the difference between the observation and corresponding fitted values. To make sure that the models we have choosen are a good fit, we need to make sure that the residuals are white noise.
From the residual analysis, we conclude the following, for the chosen model:
-
From the Time Series plot, the residuals show no trend, i.e. stationary.
-
From the histogram, we notice a normal distribution of the residuals.
-
The QQplot also displays normality.
-
And, the ACF plot represents white noise residuals.
Thus, we can say that the model we chose, fits the best.
The last step would be forecasting the monthly sales for the General Motors Company for the next
\newpage
Market Analysis: The time series analysis provides valuable insights into the monthly sales performance of General Motors Company in the US automotive market over the past two decades.
At first, a consistently steady trend is evident in the sales data. Notably, there is a spike around the year 2016, attributed to several factors including the favorable conditions of the overall automotive market in the United States and the ongoing economic recovery post the global financial crisis of 2008. Subsequently, a notable decline is observed during the 2019-2020 timeframe, attributed to the impact of the COVID-19 pandemic. Following this downturn, there is a subsequent increase in sales, characterized by a more sustained pattern.
On the model selection, the SARIMA(0,1,2)(0,1,1) model was identified as the best-fit model for the time series. The model selection process involved considering various combinations of parameters, and the chosen model demonstrated the lowest AIC, indicating a good balance between model complexity and goodness of fit.
The original time series exhibited non-stationarity, but after applying first-order differencing and seasonal differencing at lag 12, a stationary series was achieved. This transformation is essential for building accurate forecasting models.
The diagnostic analysis of the selected model's residuals indicates that the residuals exhibit characteristics of white noise, suggesting that the model adequately captures the underlying patterns in the data.
The forecasting section provides a glimpse into the future by predicting General Motors' monthly sales for the next 24 months. The forecasted values, along with prediction intervals, offer a range of potential outcomes and highlight the uncertainty associated with future predictions.
In contrast to the initial data available for this timeframe, variations are noticeable, but it appears that the prediction intervals encompass the original data. This implies that the prediction error is likely to fall within the specified interval.
-
Brockwell, Peter J., Davis, Richard A. (2016). Introduction to Time Series and Forecasting, Third Edition
-
Cryer, J.D., Chan, K-S., (2008). Time Series Analysis with applications in R, Second Edition
-
Prabhakaran, S. (2021). ARIMA Model - Complete Guide to Time Series Forecasting in Python (https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)
-
Graves, A. (2020). Time Series Forecasting with a SARIMA Model (https://towardsdatascience.com/time-series-forecasting-with-a-sarima-model-db051b7ae459)