ROSSMANN Sales Forecasting: An End-to-End Data Science Project Using Machine Learning for Sales Prediction
This project is a sales prediction using data from Rossmann, a Germany drug store chain with more than 4,000 branches across seven European countries. The dataset is available on Kaggle.
Here we have an end-to-end solution development, starting with the business demand understanding, then going through visualizations, data preparation, modeling, and ending with the deployment of the model using Heroku cloud service and a bot on Telegram to present the results to the stakeholders.
Note: If there is any problem with the opening of the jupyter notebook here on Github, please check it on this link.
This projects comes from Meigarom Lopes's course Data Science em Produção. The course goes from business understanding to the product deployment. It basically addresses the ten steps presented bellow on this ReadMe with an extreme focus on the business contex and and on the problem solving for a business value creation.
I'm very grateful to Alan Maehara, a teammate from Data Science em Produção course, for his Sales Prediction project that helped me a lot building this presentation. Many elements of this ReadMe, like tables, structure, and charts, was inspired on his project. If you'd like to see a very well structured, detailed project, you should take a look at his solution. Besides that, it's a very good learning (he explains many statistical and data science concepts and techniques) and I'm sure you'll enjoy his project building journey.
- A Brief Introduction to the ROSSMANN Company
- Project Methodology: CRISP-DM
- Phase 1: Business Understanding
- Phase 2: Data Understanding
- Phase 3: Data Preparation
- Phase 4: Modeling
- Phase 5: Evaluation
- Phase 6: Deployment
- Conclusion
This is based on the ROSSMANN company portrait available on the company website
Dirk Rossmann GmbH is one of the largest drug store chains in Europe and the largest in Germany in 2020 (see list). The retail company, founded by Dirk Roßmann in 1972 in Germany, operates over 4,000 drug stores in 7 European countries. The company is increasing the number of stores outside of Germany over the last years, as shown by this chart.
With more than 4,000 branches (2,196 of them in Germany), the company operations extend to Albania, Czech Republic, Hungary, Poland, and Turkey. The foreign companies contributed 30 percent to group sales.
The group started the year of 2020 intending to open 200 new branches and intended to maintain the rate of expansion (a volume of investiment of 200 million euros).
ROSSMANN has around around 21,700 different items with a focus on skin and body care, food and luxury foods, baby, detergents, cleaning and hair care.
The method used to manage the project was the Cross-Industry Standard Process for Data Mining (CRISP-DM). This is one of the most used techniques for Data Science Projects.
As said by Wirth and Hipp, "the CRISP-DM reference model for data mining provides an overview of the life cycle of a data mining project". This process allows us to iterate over the steps and we can map all possible problems in the project.
Aiming to provide more productivity and effectiveness, the Data Science project is broken in six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
The main reason why CRISP-DM was used is because it has four main advantages:
- for each complete cycle, we have and end-to-end solution that can be implemented and attend some of business demands;
- it provides a more quickly business results than many others methodologies;
- we can map many problems and antecipate impediments and avoid them;
- it helps us not to spend too much time on a particular step.
In this project, the CRISP-DM phases is broken into 10 steps: Business Understanding, Data Description, Feature Engineering, Filtering Variables, Exploratory Data Analysis, Data Preparation, Feature Selection, Machine Learning Modeling, Hyperparameters Fine Tuning, Translating and Interpreting the Error, and the Model Deployment.
Merging the phases and the steps we have:
Correctly identify and understand the business demand objectives and the requirements from a business perspective. This understanding has to be deeper, identifying who is the true stakeholder and the reason of the request.
- Step 0: Business Demand
It starts with the data collection and then with activities in order to get familiar with the data, to identify data quality problems, to discover some initial insights into the data, and/or to detect interesting subsets to form hypotheses for hidden information.
- Step 1: Data Collection and Description
- Step 2: Hypothesis Creation and Feature Engineering
- Step 3: Filtering Variables and Rows
- Step 4: Exploratory Data Analysis
This phase aims to prepare the data for modeling.
- Step 5: Data Preparation (transforming, scaling, ...)
- Step 6: Feature Selection
Build Machine Learning models and compare them based on Cross validation technique. After that, choose the best one and tune its parameters.
- Step 7: Machine Learning Modeling
- Step 8: Hyperparameter Fine Tuning
Evaluate the model results with some appropriate metrics. Besides that, translate that metrics to the business field.
- Step 9: Translating and Interpreting the Error
Create the model and evaluate it is not usually the end of the project. The results have to be delivered or presented to the stakeholders. So, that's what this phase is about.
- Step 10: Deploying the Machine Learning Model to Production: a telegram bot
After completing all these phases, we have a solution (even though it cannot be the best) to satisfy the initial business demand. After that, we evaluate the project outcome, the business needings, and then decide if there's need to continue with more cycles. If so, all the steps is done again in order to improve the model.
Before starting the data analysis and modeling, the first task is to understand the business problem we received as a data scientist. That's important because there's a big chance that the request we received is not exactly what the stakeholder wants (and sometimes the person who made the request for us is no the true stakeholder. In that case, we also have to identify who really is the stakeholder).
For that goal, we need to seek to understand four things:
- The context behind the business request: how did it come about?
- The reason the person is making the request;
- Who is the main stakeholder of the request. If he/she is someone other than the managers, maybe the request is not exactly that we received; Besides that, the main stakeholder can guide us in the project;
- The solution format:
- granularity: daily, weekly, by stores, by product, and so forth;
- What kind of machine learning problem it is: classification, regression, clustering, and so on;
- Main methods that could be used: time series, SVM, RNA, and so on;
- presentation method: dashboard, mail, smartphone message, and so on.
Our Data Science team received the request from the store managers to forecast the sales for its respective stores for the next six weeks. Before starting to handle the data, our team decided to better understand that business request. We found that in fact the request came up from the CFO who asked the managers for the predictions of the stores revenue for the next six weeks because he wanted to reform the stores and he wanted based on the amount of sales after six weeks of the store.
So, our team started to understand that four things that was said before:
- The business request context: in a monthly meeting, the CFO requested the managers the next six weeks sales predictions for each store;
- The reason behind the request: the CFO wants to anticipate a portion of the stores revenue to invest in the stores reformation;
- Stakeholder: CFO
- The Solution format:
- Granularity: daily sales by stores;
- Kind of problem: Sales Forecast;
- Main methods: Regression, Time Series;
- Delivering Method: Real-time six weeks sales forecasting in a smartphone app.
After better understand the business demand, our team could start handling with the data. As said before, this phase involves loading, cleaning, applying descriptive statistics, and exploring the data. This phase comprises steps one through four.
In this step we can see how challenging is the problem we are dealing with.
In a real-life project, this step starts with requests on databases, APIs, and so on in order to collect all the information available that could help us solve the problem. With that in hand, our team can create a final dataset. Since this project works with data from Kaggle, our first task is just download the csv dataset and load it into jupyter notebook.
The data is from Rossmann Store Sales on Kaggle. There's three main datasets:
- Training Data: historical data for training the model (it includes the target: sales)
- Test Data: historical data for testing the model (it includes the target: sales)
- Store Data: supplemental information about the stores
- Number of Rows: 1,017,209
- Number of Columns: 18
- Date Range: from 2013-01-01 to 2015-07-31
edited table from this project
Variable | Description | Data Type |
---|---|---|
sales (target) | the turnover for any given day (this is what we are predicting) | numerical (continuous) |
store | Store ID (unique) | numerical (discrete) |
day_of_week | day of the week (1 = Monday, 7 = Sunday) | numerical (discrete) |
date | date of each sales entry | date |
customers | the number of customers on a given day | numerical (discrete) |
open | an indicator for whether the store was open: 0 = closed, 1 = open | numerical (dummy) |
promo | indicates whether a store is running a promo on that day | numerical (dummy) |
state_holiday | indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None | categorical (nominal) |
school_holiday | indicates if the (Store, Date) was affected by the closure of public schools: 1 = affected, 0 = not affected | numerical (dummy*) |
store_type | differentiates between 4 different store models: a, b, c, d | categorical (nominal) |
assortment | describes an assortment level: a = basic, b = extra, c = extended | categorical (ordinal) |
competition_distance | distance in meters to the nearest competitor store | numerical (continuous) |
competition_open_since_month | gives the approximate month of the time the nearest competitor was opened | numerical (discrete) |
competition_open_since_year | gives the approximate year of the time the nearest competitor was opened | numerical (discrete) |
promo2 | promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating | numerical (dummy) |
promo2_since_week | describes the calendar week when the store started participating in Promo2 | numerical (discrete) |
promo2_since_year | describes the year when the store started participating in Promo2 | numerical (discrete) |
promo_interval | describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store | categorical (nominal) |
Along this step, our team identified some missing values and some inappropriate data types. It's important to deal with them because many Machine Learning algoriths can't deal with such problem.
First, we changed the 'date' column type from object to pandas datetime using the .astype()
function.
Second, we dealt with the missing data. We identified the following columns with missing values:
Variable | NANs |
---|---|
competition_distance | 2642 |
competition_open_since_month | 323348 |
competition_open_since_year | 323348 |
promo2_since_week | 508031 |
promo2_since_year | 508031 |
promo_interval | 508031 |
The first task after identifing those NANs would be to try to discover why there are missing values in the data set. In a real-life context, we would have talked to the data engineering team to better identify the reason of the existence of the missing data. Also, it could be a mistake when collecting the data. Since in this fictional project there's no one who could give us information, we cannot know exactly the reason for those NANs. Thus, the next step is to handle it.
There are some techniques to deal with missing values, like input the column mean, drop the rows or the columns, and the use of Machine Learning algoriths to predict the value. However, we tried to input variables based on the business understanding, using other columns to derive a value. Since we are working in CRISP-DM cycles, we could better handle it in the next one. Due to the high number of missing values, the traditional methods could not be so good because if we drop or input the mean we could lose information.
The inputation was as follow:
- competition_distance: maybe null values indicates that there're no competitors next. So, we inputed a value grater than the maximum distance in the column. Since the maximum is 75,860m, we inputed 200,000m to the missing data;
- competition_open_since_month and competition_open_since_year: for both, it was inputed the year from the column
date
; - promo2_since_week and promo2_since_year: if there are missing values in these columns, maybe it's because the store is not participating in promo2. So, we'll fill it out with the present date.
- promo_interval: we created the column
is_promo
(dummy variable) to indicate whether a store is holding a consecutive promo sales on that day and thepromo_interval
will be dropped in Step 3.
To summarize the data, we used the descriptive statistics. We divided the dataset between numeric and categorical variables and performed the respective statistics techniques.
For numerical variables, we basically used the central tendency (mean and median) and the dipersion (Standard Deviation, Minimum, Maximum, Range, First and Third Quartiles, Skewness, and Kurtosis) measures.
Two columns are highly skewed: competition_distance
and competition_open_since_year
. Besides that, competition_distance
kurtosis is too large, which can indicates the presence of outliers (we have to remember that we added large values when filling NANs).
Since there's no much technique to analyze categorical data, we used boxplots for them with sales.
It seems that sales tends to differ between store types and the assortment level.
To better understand the relationship between the target variable (sales
) and the features, we created some hypothesis based on the business problem to guide the feature engineering and then the exploratory data analysis.
The hypothesis created in this step will be validated in the Exploratory Data Analysis step. It let us have a good notion about the relation between the features and the target and it will help us to choose more accurately what feature to use when modeling.
to guide the hypotehsis creation, the following mind map was created. The Hypothesis Mind Map has basically three elements:
- The phenomenon: it's what we want to measure or model (sales);
- Agents: entities that somehow impact the phenomenon (customers, stores);
- Agent's Attributes: age, marritage, size, and so on.
-> Mind Map goal: derive a list of hypotheses and then we can prioritize some of them
After creating the hypothesis list, we prioritize some over others. The criterion used was the availability of the corresponding feature.
Hypothesis Final List
-
Stores with a larger assortment should sell more.
-
Stores with closer competitors should sell less.
-
Stores with longer competitors should sell more
-
Stores with active promotions for longer should sell more.
-
Stores with more promotion days should sell more.
-
Stores with more consecutive promotions should sell more.
-
Stores open during the Christmas holiday should sell more.
-
Stores should be selling more over the years.
-
Stores should sell more in the second half of the year.
-
Stores should sell more after the 10th of each month.
-
Stores should sell less on weekends.
-
Stores should sell less during school holidays.
First of all, why do we need to do it and why to do it before the EDA?
- Since we created a list of hypothesis, maybe some variables are not available in the dataset. So, to make sure that we'll have them to the exploratory data analysis, this step is done;
- To don’t have a mess in the EDA section with feature creation, maps, tables, more maps. That is, to have a clean code with feature creation and exploratory analysis in different sections.
We created the following features:
- from
date
column we createdyear
,month
,day
,week_of_year
, andyear_week
since we need them to validate some hypothesis; competition_since
: it stands for how long the competition exists considering the purchase date in terms of months;promo_since
: how long is there an active promotion.
Some categorical features had their attributes classes renamed: state_holiday
, assortment
was just letters and then became descriptions (like 'a' in state holiday that became 'public_holiday').
Since store_type
had no description in the data source, it wasn't changed.
.
Based on business restrictions (e.g. we won't have a variable avaliable in the future to use in the prediciton), we have to properly filter variables.
This step is divided into two partes: Filtering Rows and Filtering Columns.
Filtering Rows: since it's obvious that closed stores has no sales in that day, we dropped rows where open
is equals to zero. Also, we considered only the rows where sales
is greater than zero.
Filtering Columns:
customers
: we can't use customers because we won't have this data available for the next six weeks (the prediction period) unless we build another project to predict how many customers the stores will have in that period;promo_interval
andmonth_map
were used to create new columns. They'll not be used anymore.- since
open
has no use anymore, we dropped it too.
In this step we go deeper into the data to get some valuable business insights. Basically, we want to know which variables is the most important to the prediction and what is the strength of that importance.
The EDA is divided into three parts:
- Univariate Analysis: to get an overview of the individually variables, looking at their distribution and counting classes for categorical features;
- Bivariate Analysis: check the relationship between the features and the target based on the Hypothesis list. This is one of the most important steps, which we can get some business insights and decide if the feature is really important to the model.
- Multivariate Analysis: check the relationship between the features and identify those highly correlated.
It's divided into Response, Numerical, and Categorical analysis.
the target distribution seems to be non Normal. Since the majority of the Machine Learning algorithms needs the target to be normal, we checked that by using both Q-Q plot and Shapiro-Wilk test for normality. As shown by the Q-Q plot bellow, the distribution don't seems to be follow a normal distribution. Shapiro-Wilk test (statistics = 0.902; p-value = 0.000
, less than 0.05) confirms that it is probably non Gaussian. Therefore, we'll have to rescale it before performing the model.
all variables seem not to be normal. Some Highlights:
day_of_week
: sales are lower on Sundays;competition_distance
: the are more stores with competitors close than far;promo2_since_year
: there are more stores that joined consecutive promotion sales in 2013.
Some conclusions:
state_holiday
: There's a difference for the three types of holidays. Based on that difference, the model could better adjust its predictions;store_type
: for different type of stores, sales have a different behavior. For instance, store of type 'b' have sales less concentrated than the others;assortment
: there are fewer sales for stores with 'extra' assortment.
In this task, those hypothesis was validated one by one. As said before, what we basically did was:
- validate the hypothesis;
- conclude if the feature is important to use in the model;
- get some business experience.
H1: Stores with a larger assortment should sell more
This hypothesis is about the variable assortment
. We have no much information about this feature, only that it has three classes: basic, extended and extra. The counting for each one is as follows:
- basic: 444875
- extended: 391254
- extra: 8209
since there's a difference in quantity between all three classes, we used the average as a comparison (not the sum).
Conclusion: TRUE.
On average, sales seem to increase as the assortment is bigger. Also, analysing over the weeks, we can see that 'extended' and 'basic' move quite similar.
H2: Stores with closer competitors should sell less
This hypothesis is about the variable competition_distance
. It is the distance in meters to the nearest competitor store.
Conclusion: FALSE.
stores with closer competitors sell more. Competition distance and sales have a negative non linear correlation, which means that as the distance increases, sales decrease. That correlation (-0.23) is good enough to consider the variable important to the model.
H3: Stores with longer competitors should sell more
Here it is analyzed the competition_time_month
.
Conclusion: FALSE.
the more recent the competition, the higher the sale. The feature is relevant to the model because its correlation with the target is not too close to zero.
**H4: Stores with active promotions for longer should sell more **
To validate this hypothesis it was used the column promo_time_week
. It measure how long, in terms of weeks, a promotion is active.
Conclusion: FALSE.
stores with active promotions for longer sell less because sales starts to decrease after a period of time. According to the correlation, there's no evidence of a strong relationship between this feature and the target
H5: Stores with more promotion days should sell more
Our team decided to validate this hypothesis in the second CRISP cycle.
H6: Stores with more consecutive promotions should sell more
This hypothesis analyses promo
and promo2
in term of weeks of the year (year_week
).
Conclusion: FALSE.
Stores with more consecutive promotions sell less. Since both levels moves quite similar, there's no such an evidence of a strong relationship between this feature and the target.
H7: Stores open during the Christmas holiday should sell more
Here it's analyzed the state_holiday
.
Conclusion: FALSE.
On average, stores open during christmas have one of the highest sales amount, but easter holiday has a higher mean. In fact, stores sell more during holidays than during regular days. So, this feature can be considered important to the analysis.
H8: Stores should be selling more over the years
Here it is analyzed the year
column.
Conclusion: TRUE.
On average, sales are increasing over the year. Since the correlation is very high, this feature is important to the model.
H9: Stores should sell more in the second half of the year
Here we used month
. Since 2015 is incomplete, we used the mean instead of the sum to compare because the lack of data for the second half of the year could give us misinformation.
Conclusion: FALSE.
Stores sell less in the second half of the year. The feature and the target have a moderate negative correlation and it can be considered important to the model.
H10: Stores should sell more after the 10th of each month
here it was used the day
feature.
Conclusion: FALSE.
On average, there's no such a strong evidence that stores sell more after the 10th day of each month. In fact, the mean for this class is slightly smaller than for 'before_10_days'. The correlation between the feature and the target shows a relevant relationship. thus, It can be considered important to the model.
H11: Stores should sell less on weekends
Since there was less sundays in the day_of_week
, we used the mean to compare the sale by days.
Conclusion: FALSE.
on average, we can't say that sales is lower on weekends. The correlation is strong enough to be considered in the model.
H12: Stores should sell less during school holidays
Here we used school_holiday
.
Conclusion: FALSE.
there's no such an evidence that stores sell less during school holidays. On average, it's almost the same.
- Hypothesis Validation Summary and Feature Relevance
To facilitate the visualization, we present the following validation summary and the relevance of the feature.
Hypothesis | Conclusion | Feature | Relevance |
---|---|---|---|
H1 | True | assortment |
Medium |
H2 | False | competition_distance |
Medium |
H3 | False | competition_time_month |
Medium |
H4 | False | promo_time_week |
Low |
H5 | - | - | - |
H6 | False | promo , promo2 |
Low |
H7 | False | state_holiday |
Medium |
H8 | True | year |
High |
H9 | False | month |
Medium |
H10 | False | day |
Medium |
H11 | False | day_of_week |
Medium |
H12 | False | School_holiday |
Low |
For numerical attributes we used the Pearson's correlation coefficient and presented in a heatmap.
Conclusions:
- Correlation with the target:
- except by
promo
, we can't see a strong correlation between the features and the target. This is not a big problem because we also have to consider the relation of combined features with the variable.
- except by
- Multicollinearity (strong relationship between features):
- in general, features derived from other or time related features have a higher correlation value, like
month
andweek_of_year
.
- in general, features derived from other or time related features have a higher correlation value, like
For categorical attributes we used Cramér's V test. Basically, it is a measure of association between two categorical variables and it returns a value between 0 and 1. The closer to 1, the strongest the relationship.
To apply it in python we had to create a function, available in subsection 0.1 in the Jupyter Notebook.
Conclusions: we highlight the relation between store_type
and assortment
which is moderate. Even though it is higher than the others, it is not strong enough to consider dropping one of them from the dataset.
In this phase the data was prepared to the modeling. It is divided into two steps:
- Data Preparation: transformations, features scaling, normalizations;
- Feature Selection: the use of Boruta and the knowledge gained in EDA section to properly select features.
The motivation behind data preparation: the learning process of the Machine Learning algorithms is facilitated if the data is numeric and if they're in the same scale.
Since Normalization is appropriate to normal distributions and, based on the numerical variable distributions shown here in EDA, we decided to don't apply the normalization, since there's no evidence that they have a normal distribution.
Here we used Min-Max Scaler to variables with no presence of outliers and Robustscaler to variables that contain them.
- Min-Max Scaler: it was used for
year
- RobustScaler: it was used for
competition_distance
,competition_time_month
, andpromo_time_week
.
-
Encoding: it was used one-hot encoding to the variable
state_holiday
; Label Encoding tostore_type
and Ordinal Encoding toassortment
. -
Response Variable Transformation: Since ML algorithms need the response to be normal (or close to that), we performed a log transformation on the target (
sales
) variable. -
cyclic transformation (for time-related variables): since
day_of_week
,month
,day
, andweek_of_year
have a cyclical nature (for each period, they repeat their values, i.g. for each week,day
goes from 1 to 7), we created new variables containing the sin and cossin for all of those variables to represent that cyclical nature. So, the following columns was created:day_of_week_sin
,day_of_week_cos
,month_sin
,month_cos
,day_sin
,day_cos
,week_of_year_sin
, andweek_of_year_cos
The focus here is to keep that variables that better explains the target. Here we followed the Occam’s Razor principle that a more simple explanation ( or model ) of the problem should be chosen instead of a complex one. So, a model containing only the important features can better generalize (better make predictions).
To help us decided what features to select, we performed Boruta on the dataset. Boruta is a wrapper method of feature selection, that is, a method that uses a Machine Learning algorithm to determine the best features. For more about this feature selection algorithms, we recommend this post.
Variables Selected |
---|
store |
promo |
store_type |
assortment |
competition_distance |
competition_open_since_month |
competition_open_since_year |
promo2 |
promo2_since_week |
promo2_since_year |
competition_time_month |
promo_time_week |
day_of_week_sin |
day_of_week_cos |
month_sin |
month_cos |
day_sin |
day_cos |
week_of_year_cos |
Variables not Selected |
---|
is_promo |
month_sin |
school_holiday |
state_holiday_christmas |
state_holiday_easter_holiday |
state_holiday_holiday_holiday |
state_holiday_regular_holiday |
week_of_year_sin |
year |
Now we had to analyze both Boruta's result and the feature relevance from EDA section.
Thus, the features manually selected are in the following final list:
Variables Selected |
---|
store |
promo |
store_type |
assortment |
competition_distance |
competition_open_since_month |
competition_open_since_year |
promo2 |
promo2_since_week |
promo2_since_year |
competition_time_month |
promo_time_week |
day_of_week_sin |
day_of_week_cos |
month_sin |
month_cos |
day_sin |
day_cos |
week_of_year_sin |
week_of_year_cos |
Final list explanation:
promo
andpromo2
was classified with a low relevance in EDA, but we decided to keep them in the dataset and explore better in the next CRISP cycle;- even though Boruta didn't select
month_sin
, also decided to keep in the dataset since the variablemonth
has a medium relevance to the target; year
was identified as high relevant to the target in the EDA step. However, since Boruta rejected it and the 2015 year is incomplete, we decided to exclude it from the dataset;- We concluded in EDA that
school_holiday
has a low relevance to the target. Since it was also rejected by Boruta, it was excluded from the dataset; - Boruta rejected
state_holiday
's encodings and it was classified as a medium relevance to the model in EDA. We decided to exclude its encodings from the dataset and in the next cycle we'll work better on them. - Boruta also rejected
week_of_year_sin
, but we kept them in the model.
This phase is about learning the data behavior to be able make generalizations in the future. It comprises two steps: ML modeling and the parameters tuning.
This step aims to choose the best Machine Learning model. So first we performed 5 models (one average model, two linear and two tree-based models, as explained in the following) and analyzed its single performance (a 1 fold analysis). However, to better compare them, we created a Cross Validation function for Time Series (available in the section 0.1 in the Jupyter Notebook) and consider the data variation across many time periods.
The models used was:
1. Average Model: a simpel model to serve as a baseline to compare if the others are better than the mean; 2. Linear Regression: a statistical technique that fits the best line that minimizes the error in order to predict a depedent continuous variable; 3. Regularized Linear Regression - Lasso: uses shrinkage (where data values are shrunk towards a central point, like the mean), penalizing the features' parameters by adding the absolute value of each parameter in the model; 4. Random Forest Regression: an ensemble model that combines many decision trees to improve prediction; 5. XGBoost Regression: it is also based on decision trees, but uses gradient boosting algorithms.
So, first we fitted the five models. The results is presented bellow.
Model Name | MAE | MAPE | RMSE | Time to run |
---|---|---|---|---|
Random Forest Regressor | 676.82 | 0.10 | 1005.00 | 44m |
XGBoost Regressor | 856.03 | 0.12 | 1265.33 | 26m7s |
Average Model | 1354.80 | 0.21 | 1835.14 | 341ms |
Linear Regression | 1867.65 | 0.29 | 2671.33 | 2.51s |
Lasso Regression | 1891.46 | 0.29 | 2742.92 | 2.17s |
Conclusions::
- Both Linear and Lasso Regressions performed worse than the Average Model: their errors are greater than the average ones;
- So, the data has a complex behavior (non-linear) and, maybe, Linear Models can't learn its behavior;
- Regularized Linear Regression performed even worse than Linear Regerssion Model.
- Random Forest Regressor model got the smaller errors. However, it took too long to run (and only with 100 estimators)
Then we applied a Time Series Cross Validation, because this method can appropriately consider the data variation across many time periods. The goal is to get the mean error and the standard deviation for all folds tested. The results is shown above.
Model Name | MAE CV | MAPE CV | RMSE CV | Time to Run |
---|---|---|---|---|
Random Forest Regressor | 797.21 +/- 147.56 | 0.11 +/- 0.02 | 1198.69 +/- 269.98 | 2h25m |
XGBoost Regressor | 1028.61 +/- 120.34 | 0.14 +/- 0.01 | 1473.61 +/- 211.25 | 1h39s |
Linear Regression | 1937.11 +/- 79.38 | 0.29 +/- 0.02 | 2745.97 +/- 154.27 | 13.1s |
Lasso | 1978.51 +/- 97.02 | 0.28 +/- 0.01 | 2849.0 +/- 200.1 | 29.3s |
Model Selection Conclusions: Since XGBoost is the second best model (in terms of errors) and it took less time to run, we decided to finish this cycle using it. Another reason is because after tuning the parameters of teh Random Forest Regressor, it could get even more time to run. Since time is a cost in a business context, we have to consider it when making those decisions. In the second CRISP cycle we can use another model and improve the model's performance.
Here we wanted to find the best set of parameters that maximizes the algorithm learning. We made it by applying the Random Search method. The reason we chose it is because it chooses randomly the parameters and then it is more fast. The best parameters was as follows:
param_tuned = { 'n_estimators': 3000, 'eta': 0.03, 'max_depth': 5, 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 3 }
Thus, the new error was markedly less.
Model Name | MAE CV | MAPE CV | RMSE CV |
---|---|---|---|
XGBoost Regressor | 644.21 | 0.10 | 933.16 |
MAPE improved by 4%, from 14% to 10%.
Here we evaluated the model results with some appropriate metrics. Besides that, we translated that metrics to the business field.
This step is about looking at the error and translating it to a business language.
"What's the impact to the business? the model is usefull or I still have to improve it more?". These are examples of the questions we wanted to answer in this phase.
This step is divided into two:
I. Business Performance The average of all predicted sales for the next six weeks gives us the business performance for each store. It was created the best and the worst scenarios based on the summing and the subtracting the Mean Absolute Error (MAE) from the predictions.
These scenarios help the manager better make decisions about the investment on each store and consider the best or the worst scenario.
Scenario | Values |
---|---|
worst_scenario | R$ 286,006,481.05 |
predictions | R$ 286,728,640.00 |
best_scenario | R$ 287,450,811.39 |
Bellow it is presented the eight stores with the highest Mean Absolute Percentage Error (MAPE).
store | predictions | worst_scenario | best_scenario | MAE | MAPE |
---|---|---|---|---|---|
291 | 292 | 105383.86 | 102061.26 | 108706.46 | 3322.60 |
908 | 909 | 237669.67 | 230022.56 | 245316.78 | 7647.11 |
594 | 595 | 344569.97 | 339579.62 | 349560.32 | 4990.35 |
875 | 876 | 207206.31 | 203287.59 | 211125.03 | 3918.72 |
721 | 722 | 357292.44 | 355184.43 | 359400.45 | 2108.01 |
717 | 718 | 201979.52 | 200086.88 | 203872.15 | 1892.64 |
273 | 274 | 193574.20 | 192156.15 | 194992.26 | 1418.05 |
781 | 782 | 221717.41 | 220967.12 | 222467.70 | 750.29 |
We can see that there are stores that MAPE corresponds more than 50%, which means that predictions are off by more than 50%. Let's look at a scatter plot of MAPE.
The majority of the Mean Absolute Percentage Errors lies between 5% and 20%. Since this is a fictional project, we can't talk to the business team and get their approval to the predictions. So, let's pretend that they approved and keep going.
II. Machine Learning Performance This is the last analysis before the model deployment. Here it was analyzed the overall model performance. To do that, we present five graphs. Starting by the fit of the model, the chart bellow shows that the predictions seem to fit well to the real sales.
The error rate (the ratio between prediction values and observed ones) is presented by the following chart. We can see that it varies around 0.15, which can be considered low to this first cycle. We'll try reduce it in the next CRISP cycle.
It is important to analyze the residuals behaviour when dealing with regression. One of the most important premises of a good model is the residuals to have a normal-shaped distribution with zero mean and constant variance. The following chart shows that the residuals seem to be normal.
This is another chart that helps us analyze the residuals. The expected shape is the residuals concentrated within a 'tube'. Since we can't see any tendency in residuals, it seems to don't have any kind of heteroscedasticity.
The last task in this step is to check the fit of the residuals to the normal distribution. As shown bellow, it's not a perfect fit, but it's good enough to continue in this cycle and we improve it later.
The model creation and evalutation is not usually the end of the project. The results have to be delivered or presented to the stakeholders. So, that's what this phase is about.
We decided to present the results on the stakeholder's smartphone. To do that, we deployed the model in a cloud server and we created a Telegram bot to present the results.
We saved the model, the scalings, and the transformations in Heroku, a platform "that enables developers to build, run, and operate applications entirely in the cloud" (see website )
After test the application and requests locally, the bot was created. The production structure is as follows:
How it works:
- the user texts the store number to the Telegram Bot;
- the Rossmann API (
rossmann-bot.py
) receives the request and retrieve the data to that store from the test dataset; - the Rossmann API send the data to Handler API (
handler.py
); - the Handler API gets the data preparation to shape the raw data and generate predictions using the model (
model_rossman.pkl
); - the Handler returns the predictions to Rossmann API; and,
- the Rossmann API returns the sales prediction to the user on Telegram.
The following gif shows the bot receiving the requests and sending back the predictions. The bot is configured to return 'Store Not Available' if the specified store is not in the test dataset and 'Wrogn ID' if the user types something other than a number.
Our team will decide the delivering method in the next cycle, maybe we could use another one. However, there are some additional information that could be added to the bot, like:
- a welcome message;
- the best and the worst scenarios for the stores;
- the total prediction (also the total best and worst scenarios);
- display a chart;
- request to more than on store;
- display a 'wait' message while the request is made.
In this project we built an end-to-end sales prediction project going from the initial business understanding to the product deployment to the stakeholder in a bot in Telegram using the CRISP-DM methodology cycles. Besides the business knowledge gained in the Exploratory Data Analysis, we built a model to properly predict the ahead six weeks sales to Rossmann Stores using XGBoost algorithm.
After two months and one day (including this ReadMe creation), I finished the first cycle of this project and I want to highlight two main lessons I learned:
- The construction of an end-to-end Data Science solution is challenging, both in terms of business understanding and Machine Learning techniques;
- We need more than Python and Statistics to really create a business value with Data Science: we need to know how to solve a business problem; to understand its demand; to predict the challenges we'll face during the journey and mitigate them; and, to develop a suitable solution to a better use by the stakeholders. In summary, we need to have a business perspective when dealing with these projects.
That's all thanks to Meigarom's course.
"Do… or do not. There is no try." - Master Yoda