ROSSMANN Sales Forecasting: An End-to-End Data Science Project Using Machine Learning for Sales Prediction

This project is a sales prediction using data from Rossmann, a Germany drug store chain with more than 4,000 branches across seven European countries. The dataset is available on Kaggle.

Here we have an end-to-end solution development, starting with the business demand understanding, then going through visualizations, data preparation, modeling, and ending with the deployment of the model using Heroku cloud service and a bot on Telegram to present the results to the stakeholders.

Note: If there is any problem with the opening of the jupyter notebook here on Github, please check it on this link.

Special Mention

This projects comes from Meigarom Lopes's course Data Science em Produção. The course goes from business understanding to the product deployment. It basically addresses the ten steps presented bellow on this ReadMe with an extreme focus on the business contex and and on the problem solving for a business value creation.

I'm very grateful to Alan Maehara, a teammate from Data Science em Produção course, for his Sales Prediction project that helped me a lot building this presentation. Many elements of this ReadMe, like tables, structure, and charts, was inspired on his project. If you'd like to see a very well structured, detailed project, you should take a look at his solution. Besides that, it's a very good learning (he explains many statistical and data science concepts and techniques) and I'm sure you'll enjoy his project building journey.

A Brief Introduction to the ROSSMANN Company
Project Methodology: CRISP-DM
Phase 1: Business Understanding
- The Business Context
Phase 2: Data Understanding
Phase 3: Data Preparation
- Step 5: Data Preparation
- Step 6: Feature Selection
Phase 4: Modeling
- Step 7: Machine Learning Modeling
- Step 8: Hyperparameter Fine Tuning
Phase 5: Evaluation
- Step 9: Translating and Interpreting the Error
Phase 6: Deployment
- 10. Deploying Machine Learning Model to Production: a telegram bot
Conclusion

A Brief Introduction to the ROSSMANN Company

This is based on the ROSSMANN company portrait available on the company website

Dirk Rossmann GmbH is one of the largest drug store chains in Europe and the largest in Germany in 2020 (see list). The retail company, founded by Dirk Roßmann in 1972 in Germany, operates over 4,000 drug stores in 7 European countries. The company is increasing the number of stores outside of Germany over the last years, as shown by this chart.

With more than 4,000 branches (2,196 of them in Germany), the company operations extend to Albania, Czech Republic, Hungary, Poland, and Turkey. The foreign companies contributed 30 percent to group sales.

The group started the year of 2020 intending to open 200 new branches and intended to maintain the rate of expansion (a volume of investiment of 200 million euros).

ROSSMANN has around around 21,700 different items with a focus on skin and body care, food and luxury foods, baby, detergents, cleaning and hair care.

Back to Contents

Project Methodology: CRISP-DM

The method used to manage the project was the Cross-Industry Standard Process for Data Mining (CRISP-DM). This is one of the most used techniques for Data Science Projects.

As said by Wirth and Hipp, "the CRISP-DM reference model for data mining provides an overview of the life cycle of a data mining project". This process allows us to iterate over the steps and we can map all possible problems in the project.

Aiming to provide more productivity and effectiveness, the Data Science project is broken in six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

CRISP-DM Cycle

The main reason why CRISP-DM was used is because it has four main advantages:

for each complete cycle, we have and end-to-end solution that can be implemented and attend some of business demands;
it provides a more quickly business results than many others methodologies;
we can map many problems and antecipate impediments and avoid them;
it helps us not to spend too much time on a particular step.

In this project, the CRISP-DM phases is broken into 10 steps: Business Understanding, Data Description, Feature Engineering, Filtering Variables, Exploratory Data Analysis, Data Preparation, Feature Selection, Machine Learning Modeling, Hyperparameters Fine Tuning, Translating and Interpreting the Error, and the Model Deployment.

Merging the phases and the steps we have:

Phase 1: Business Understanding

Correctly identify and understand the business demand objectives and the requirements from a business perspective. This understanding has to be deeper, identifying who is the true stakeholder and the reason of the request.

Step 0: Business Demand

Phase 2: Data Understanding

It starts with the data collection and then with activities in order to get familiar with the data, to identify data quality problems, to discover some initial insights into the data, and/or to detect interesting subsets to form hypotheses for hidden information.

Step 1: Data Collection and Description
Step 2: Hypothesis Creation and Feature Engineering
Step 3: Filtering Variables and Rows
Step 4: Exploratory Data Analysis

Phase 3: Data Preparation

This phase aims to prepare the data for modeling.

Step 5: Data Preparation (transforming, scaling, ...)
Step 6: Feature Selection

Phase 4: Modeling

Build Machine Learning models and compare them based on Cross validation technique. After that, choose the best one and tune its parameters.

Step 7: Machine Learning Modeling
Step 8: Hyperparameter Fine Tuning

Phase 5: Evaluation

Evaluate the model results with some appropriate metrics. Besides that, translate that metrics to the business field.

Step 9: Translating and Interpreting the Error

Phase 6: Deployment

Create the model and evaluate it is not usually the end of the project. The results have to be delivered or presented to the stakeholders. So, that's what this phase is about.

Step 10: Deploying the Machine Learning Model to Production: a telegram bot

Next Cycle

After completing all these phases, we have a solution (even though it cannot be the best) to satisfy the initial business demand. After that, we evaluate the project outcome, the business needings, and then decide if there's need to continue with more cycles. If so, all the steps is done again in order to improve the model.

back to contents

Phase 1: Business Understanding

Before starting the data analysis and modeling, the first task is to understand the business problem we received as a data scientist. That's important because there's a big chance that the request we received is not exactly what the stakeholder wants (and sometimes the person who made the request for us is no the true stakeholder. In that case, we also have to identify who really is the stakeholder).

For that goal, we need to seek to understand four things:

The context behind the business request: how did it come about?
The reason the person is making the request;
Who is the main stakeholder of the request. If he/she is someone other than the managers, maybe the request is not exactly that we received; Besides that, the main stakeholder can guide us in the project;
The solution format:
- granularity: daily, weekly, by stores, by product, and so forth;
- What kind of machine learning problem it is: classification, regression, clustering, and so on;
- Main methods that could be used: time series, SVM, RNA, and so on;
- presentation method: dashboard, mail, smartphone message, and so on.

The Business Context

⚠️⚠️ Disclaimer: since we don't work for ROSSMANN and we have only the dataset, it'll be helpful to create a hypothetical business context to guide the project. The following context is basically a mix between the Rossmann description and request of the dataset on Kaggle and some additions from the Data Science em Produção course. So, let's pretend we work as data scientists for ROSSMANN.

Our Data Science team received the request from the store managers to forecast the sales for its respective stores for the next six weeks. Before starting to handle the data, our team decided to better understand that business request. We found that in fact the request came up from the CFO who asked the managers for the predictions of the stores revenue for the next six weeks because he wanted to reform the stores and he wanted based on the amount of sales after six weeks of the store.

So, our team started to understand that four things that was said before:

The business request context: in a monthly meeting, the CFO requested the managers the next six weeks sales predictions for each store;
The reason behind the request: the CFO wants to anticipate a portion of the stores revenue to invest in the stores reformation;
Stakeholder: CFO
The Solution format:
- Granularity: daily sales by stores;
- Kind of problem: Sales Forecast;
- Main methods: Regression, Time Series;
- Delivering Method: Real-time six weeks sales forecasting in a smartphone app.

back to contents

Phase 2: Data Understanding

After better understand the business demand, our team could start handling with the data. As said before, this phase involves loading, cleaning, applying descriptive statistics, and exploring the data. This phase comprises steps one through four.

Step 1: Data Collection and Description

In this step we can see how challenging is the problem we are dealing with.

Data Collection

In a real-life project, this step starts with requests on databases, APIs, and so on in order to collect all the information available that could help us solve the problem. With that in hand, our team can create a final dataset. Since this project works with data from Kaggle, our first task is just download the csv dataset and load it into jupyter notebook.

The data is from Rossmann Store Sales on Kaggle. There's three main datasets:

Training Data: historical data for training the model (it includes the target: sales)
Test Data: historical data for testing the model (it includes the target: sales)
Store Data: supplemental information about the stores

Training Data Dimensions:

- Number of Rows: 1,017,209
- Number of Columns: 18
- Date Range: from 2013-01-01 to 2015-07-31

Variables Description and types:

edited table from this project

Variable	Description	Data Type
sales (target)	the turnover for any given day (this is what we are predicting)	numerical (continuous)
store	Store ID (unique)	numerical (discrete)
day_of_week	day of the week (1 = Monday, 7 = Sunday)	numerical (discrete)
date	date of each sales entry	date
customers	the number of customers on a given day	numerical (discrete)
open	an indicator for whether the store was open: 0 = closed, 1 = open	numerical (dummy)
promo	indicates whether a store is running a promo on that day	numerical (dummy)
state_holiday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None	categorical (nominal)
school_holiday	indicates if the (Store, Date) was affected by the closure of public schools: 1 = affected, 0 = not affected	numerical (dummy*)
store_type	differentiates between 4 different store models: a, b, c, d	categorical (nominal)
assortment	describes an assortment level: a = basic, b = extra, c = extended	categorical (ordinal)
competition_distance	distance in meters to the nearest competitor store	numerical (continuous)
competition_open_since_month	gives the approximate month of the time the nearest competitor was opened	numerical (discrete)
competition_open_since_year	gives the approximate year of the time the nearest competitor was opened	numerical (discrete)
promo2	promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating	numerical (dummy)
promo2_since_week	describes the calendar week when the store started participating in Promo2	numerical (discrete)
promo2_since_year	describes the year when the store started participating in Promo2	numerical (discrete)
promo_interval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store	categorical (nominal)

Data Cleaning: Imputation and Changing Types

Along this step, our team identified some missing values and some inappropriate data types. It's important to deal with them because many Machine Learning algoriths can't deal with such problem.
First, we changed the 'date' column type from object to pandas datetime using the .astype() function.

Second, we dealt with the missing data. We identified the following columns with missing values:

Variable	NANs
competition_distance	2642
competition_open_since_month	323348
competition_open_since_year	323348
promo2_since_week	508031
promo2_since_year	508031
promo_interval	508031

The first task after identifing those NANs would be to try to discover why there are missing values in the data set. In a real-life context, we would have talked to the data engineering team to better identify the reason of the existence of the missing data. Also, it could be a mistake when collecting the data. Since in this fictional project there's no one who could give us information, we cannot know exactly the reason for those NANs. Thus, the next step is to handle it.

There are some techniques to deal with missing values, like input the column mean, drop the rows or the columns, and the use of Machine Learning algoriths to predict the value. However, we tried to input variables based on the business understanding, using other columns to derive a value. Since we are working in CRISP-DM cycles, we could better handle it in the next one. Due to the high number of missing values, the traditional methods could not be so good because if we drop or input the mean we could lose information.

The inputation was as follow:

competition_distance: maybe null values indicates that there're no competitors next. So, we inputed a value grater than the maximum distance in the column. Since the maximum is 75,860m, we inputed 200,000m to the missing data;
competition_open_since_month and competition_open_since_year: for both, it was inputed the year from the column date;
promo2_since_week and promo2_since_year: if there are missing values in these columns, maybe it's because the store is not participating in promo2. So, we'll fill it out with the present date.
promo_interval: we created the column is_promo (dummy variable) to indicate whether a store is holding a consecutive promo sales on that day and the promo_interval will be dropped in Step 3.

Descriptive Statistics

To summarize the data, we used the descriptive statistics. We divided the dataset between numeric and categorical variables and performed the respective statistics techniques.

For numerical variables, we basically used the central tendency (mean and median) and the dipersion (Standard Deviation, Minimum, Maximum, Range, First and Third Quartiles, Skewness, and Kurtosis) measures.

Two columns are highly skewed: competition_distance and competition_open_since_year. Besides that, competition_distance kurtosis is too large, which can indicates the presence of outliers (we have to remember that we added large values when filling NANs).

Since there's no much technique to analyze categorical data, we used boxplots for them with sales.

It seems that sales tends to differ between store types and the assortment level.

Step 2: Hypothesis Creation and Feature Engineering

To better understand the relationship between the target variable (sales) and the features, we created some hypothesis based on the business problem to guide the feature engineering and then the exploratory data analysis.

The hypothesis created in this step will be validated in the Exploratory Data Analysis step. It let us have a good notion about the relation between the features and the target and it will help us to choose more accurately what feature to use when modeling.

Hypothesis Mind Map

to guide the hypotehsis creation, the following mind map was created. The Hypothesis Mind Map has basically three elements:

The phenomenon: it's what we want to measure or model (sales);
Agents: entities that somehow impact the phenomenon (customers, stores);
Agent's Attributes: age, marritage, size, and so on.

-> Mind Map goal: derive a list of hypotheses and then we can prioritize some of them

Hypothesis Creation

After creating the hypothesis list, we prioritize some over others. The criterion used was the availability of the corresponding feature.

Hypothesis Final List

Stores with a larger assortment should sell more.
Stores with closer competitors should sell less.
Stores with longer competitors should sell more
Stores with active promotions for longer should sell more.
Stores with more promotion days should sell more.
Stores with more consecutive promotions should sell more.
Stores open during the Christmas holiday should sell more.
Stores should be selling more over the years.
Stores should sell more in the second half of the year.
Stores should sell more after the 10th of each month.
Stores should sell less on weekends.
Stores should sell less during school holidays.

Feature Engineering

First of all, why do we need to do it and why to do it before the EDA?

Since we created a list of hypothesis, maybe some variables are not available in the dataset. So, to make sure that we'll have them to the exploratory data analysis, this step is done;
To don’t have a mess in the EDA section with feature creation, maps, tables, more maps. That is, to have a clean code with feature creation and exploratory analysis in different sections.

We created the following features:

from date column we created year, month, day, week_of_year, and year_week since we need them to validate some hypothesis;
competition_since: it stands for how long the competition exists considering the purchase date in terms of months;
promo_since: how long is there an active promotion.

Some categorical features had their attributes classes renamed: state_holiday, assortment was just letters and then became descriptions (like 'a' in state holiday that became 'public_holiday').

Since store_type had no description in the data source, it wasn't changed. .

Step 3: Filtering Variables and Rows

Based on business restrictions (e.g. we won't have a variable avaliable in the future to use in the prediciton), we have to properly filter variables.

This step is divided into two partes: Filtering Rows and Filtering Columns.

Filtering Rows: since it's obvious that closed stores has no sales in that day, we dropped rows where open is equals to zero. Also, we considered only the rows where sales is greater than zero.

Filtering Columns:

customers: we can't use customers because we won't have this data available for the next six weeks (the prediction period) unless we build another project to predict how many customers the stores will have in that period;
promo_interval and month_map were used to create new columns. They'll not be used anymore.
since open has no use anymore, we dropped it too.

Step 4: Exploratory Data Analysis (EDA)

In this step we go deeper into the data to get some valuable business insights. Basically, we want to know which variables is the most important to the prediction and what is the strength of that importance.

The EDA is divided into three parts:

Univariate Analysis: to get an overview of the individually variables, looking at their distribution and counting classes for categorical features;
Bivariate Analysis: check the relationship between the features and the target based on the Hypothesis list. This is one of the most important steps, which we can get some business insights and decide if the feature is really important to the model.
Multivariate Analysis: check the relationship between the features and identify those highly correlated.

Univariate Analysis

It's divided into Response, Numerical, and Categorical analysis.

Response Variable (`sales`):

the target distribution seems to be non Normal. Since the majority of the Machine Learning algorithms needs the target to be normal, we checked that by using both Q-Q plot and Shapiro-Wilk test for normality. As shown by the Q-Q plot bellow, the distribution don't seems to be follow a normal distribution. Shapiro-Wilk test (statistics = 0.902; p-value = 0.000, less than 0.05) confirms that it is probably non Gaussian. Therefore, we'll have to rescale it before performing the model.

Numerical Variables:

all variables seem not to be normal. Some Highlights:

day_of_week: sales are lower on Sundays;
competition_distance: the are more stores with competitors close than far;
promo2_since_year: there are more stores that joined consecutive promotion sales in 2013.

Categorical Variables:

Some conclusions:

state_holiday: There's a difference for the three types of holidays. Based on that difference, the model could better adjust its predictions;
store_type: for different type of stores, sales have a different behavior. For instance, store of type 'b' have sales less concentrated than the others;
assortment: there are fewer sales for stores with 'extra' assortment.

Bivariate Analysis

In this task, those hypothesis was validated one by one. As said before, what we basically did was:

validate the hypothesis;
conclude if the feature is important to use in the model;
get some business experience.

H1: Stores with a larger assortment should sell more
This hypothesis is about the variable assortment. We have no much information about this feature, only that it has three classes: basic, extended and extra. The counting for each one is as follows:

basic: 444875
extended: 391254
extra: 8209

since there's a difference in quantity between all three classes, we used the average as a comparison (not the sum).

Conclusion: TRUE.
On average, sales seem to increase as the assortment is bigger. Also, analysing over the weeks, we can see that 'extended' and 'basic' move quite similar.

H2: Stores with closer competitors should sell less
This hypothesis is about the variable competition_distance. It is the distance in meters to the nearest competitor store.

Conclusion: FALSE.
stores with closer competitors sell more. Competition distance and sales have a negative non linear correlation, which means that as the distance increases, sales decrease. That correlation (-0.23) is good enough to consider the variable important to the model.

H3: Stores with longer competitors should sell more
Here it is analyzed the competition_time_month.

Conclusion: FALSE.
the more recent the competition, the higher the sale. The feature is relevant to the model because its correlation with the target is not too close to zero.

**H4: Stores with active promotions for longer should sell more **
To validate this hypothesis it was used the column promo_time_week. It measure how long, in terms of weeks, a promotion is active.

Conclusion: FALSE.
stores with active promotions for longer sell less because sales starts to decrease after a period of time. According to the correlation, there's no evidence of a strong relationship between this feature and the target

~~H5: Stores with more promotion days should sell more~~
Our team decided to validate this hypothesis in the second CRISP cycle.

H6: Stores with more consecutive promotions should sell more
This hypothesis analyses promo and promo2 in term of weeks of the year (year_week).

Conclusion: FALSE.
Stores with more consecutive promotions sell less. Since both levels moves quite similar, there's no such an evidence of a strong relationship between this feature and the target.

H7: Stores open during the Christmas holiday should sell more
Here it's analyzed the state_holiday.

Conclusion: FALSE.
On average, stores open during christmas have one of the highest sales amount, but easter holiday has a higher mean. In fact, stores sell more during holidays than during regular days. So, this feature can be considered important to the analysis.

H8: Stores should be selling more over the years
Here it is analyzed the year column.

Conclusion: TRUE.
On average, sales are increasing over the year. Since the correlation is very high, this feature is important to the model.

H9: Stores should sell more in the second half of the year
Here we used month. Since 2015 is incomplete, we used the mean instead of the sum to compare because the lack of data for the second half of the year could give us misinformation.

Conclusion: FALSE.
Stores sell less in the second half of the year. The feature and the target have a moderate negative correlation and it can be considered important to the model.

H10: Stores should sell more after the 10th of each month here it was used the day feature.

Conclusion: FALSE.
On average, there's no such a strong evidence that stores sell more after the 10th day of each month. In fact, the mean for this class is slightly smaller than for 'before_10_days'. The correlation between the feature and the target shows a relevant relationship. thus, It can be considered important to the model.

H11: Stores should sell less on weekends
Since there was less sundays in the day_of_week, we used the mean to compare the sale by days.

Conclusion: FALSE.
on average, we can't say that sales is lower on weekends. The correlation is strong enough to be considered in the model.

H12: Stores should sell less during school holidays
Here we used school_holiday.

Conclusion: FALSE.
there's no such an evidence that stores sell less during school holidays. On average, it's almost the same.

Hypothesis Validation Summary and Feature Relevance

To facilitate the visualization, we present the following validation summary and the relevance of the feature.

Hypothesis	Conclusion	Feature	Relevance
H1	True	`assortment`	Medium
H2	False	`competition_distance`	Medium
H3	False	`competition_time_month`	Medium
H4	False	`promo_time_week`	Low
H5	-	-	-
H6	False	`promo`, `promo2`	Low
H7	False	`state_holiday`	Medium
H8	True	`year`	High
H9	False	`month`	Medium
H10	False	`day`	Medium
H11	False	`day_of_week`	Medium
H12	False	`School_holiday`	Low

Multivariate Analysis

Numerical Variables

For numerical attributes we used the Pearson's correlation coefficient and presented in a heatmap.

Conclusions:

Correlation with the target:
- except by promo, we can't see a strong correlation between the features and the target. This is not a big problem because we also have to consider the relation of combined features with the variable.
Multicollinearity (strong relationship between features):
- in general, features derived from other or time related features have a higher correlation value, like month and week_of_year.

Categorical Variables

For categorical attributes we used Cramér's V test. Basically, it is a measure of association between two categorical variables and it returns a value between 0 and 1. The closer to 1, the strongest the relationship.

To apply it in python we had to create a function, available in subsection 0.1 in the Jupyter Notebook.

Conclusions: we highlight the relation between store_type and assortment which is moderate. Even though it is higher than the others, it is not strong enough to consider dropping one of them from the dataset.

back to contents

Phase 3: Data Preparation

In this phase the data was prepared to the modeling. It is divided into two steps:

Data Preparation: transformations, features scaling, normalizations;
Feature Selection: the use of Boruta and the knowledge gained in EDA section to properly select features.

Step 5: Data Preparation

The motivation behind data preparation: the learning process of the Machine Learning algorithms is facilitated if the data is numeric and if they're in the same scale.

Normalization

Since Normalization is appropriate to normal distributions and, based on the numerical variable distributions shown here in EDA, we decided to don't apply the normalization, since there's no evidence that they have a normal distribution.

Rescaling

Here we used Min-Max Scaler to variables with no presence of outliers and Robustscaler to variables that contain them.

Min-Max Scaler: it was used for year
RobustScaler: it was used for competition_distance, competition_time_month, and promo_time_week.

Transformation

Encoding: it was used one-hot encoding to the variable state_holiday; Label Encoding to store_type and Ordinal Encoding to assortment.
Response Variable Transformation: Since ML algorithms need the response to be normal (or close to that), we performed a log transformation on the target (sales) variable.
cyclic transformation (for time-related variables): since day_of_week, month, day, and week_of_year have a cyclical nature (for each period, they repeat their values, i.g. for each week, day goes from 1 to 7), we created new variables containing the sin and cossin for all of those variables to represent that cyclical nature. So, the following columns was created: day_of_week_sin, day_of_week_cos, month_sin, month_cos, day_sin, day_cos, week_of_year_sin, and week_of_year_cos

Step 6: Feature Selection

The focus here is to keep that variables that better explains the target. Here we followed the Occam’s Razor principle that a more simple explanation ( or model ) of the problem should be chosen instead of a complex one. So, a model containing only the important features can better generalize (better make predictions).

To help us decided what features to select, we performed Boruta on the dataset. Boruta is a wrapper method of feature selection, that is, a method that uses a Machine Learning algorithm to determine the best features. For more about this feature selection algorithms, we recommend this post.

Variables Selected
`store`
`promo`
`store_type`
`assortment`
`competition_distance`
`competition_open_since_month`
`competition_open_since_year`
`promo2`
`promo2_since_week`
`promo2_since_year`
`competition_time_month`
`promo_time_week`
`day_of_week_sin`
`day_of_week_cos`
`month_sin`
`month_cos`
`day_sin`
`day_cos`
`week_of_year_cos`

Variables not Selected
`is_promo`
`month_sin`
`school_holiday`
`state_holiday_christmas`
`state_holiday_easter_holiday`
`state_holiday_holiday_holiday`
`state_holiday_regular_holiday`
`week_of_year_sin`
`year`

Now we had to analyze both Boruta's result and the feature relevance from EDA section.

Thus, the features manually selected are in the following final list:

Variables Selected
`store`
`promo`
`store_type`
`assortment`
`competition_distance`
`competition_open_since_month`
`competition_open_since_year`
`promo2`
`promo2_since_week`
`promo2_since_year`
`competition_time_month`
`promo_time_week`
`day_of_week_sin`
`day_of_week_cos`
`month_sin`
`month_cos`
`day_sin`
`day_cos`
`week_of_year_sin`
`week_of_year_cos`

Final list explanation:

promoand promo2 was classified with a low relevance in EDA, but we decided to keep them in the dataset and explore better in the next CRISP cycle;
even though Boruta didn't select month_sin, also decided to keep in the dataset since the variable month has a medium relevance to the target;
year was identified as high relevant to the target in the EDA step. However, since Boruta rejected it and the 2015 year is incomplete, we decided to exclude it from the dataset;
We concluded in EDA that school_holiday has a low relevance to the target. Since it was also rejected by Boruta, it was excluded from the dataset;
Boruta rejected state_holiday's encodings and it was classified as a medium relevance to the model in EDA. We decided to exclude its encodings from the dataset and in the next cycle we'll work better on them.
Boruta also rejected week_of_year_sin, but we kept them in the model.

back to contents

Phase 4: Modeling

This phase is about learning the data behavior to be able make generalizations in the future. It comprises two steps: ML modeling and the parameters tuning.

Step 7: Machine Learning Modeling

This step aims to choose the best Machine Learning model. So first we performed 5 models (one average model, two linear and two tree-based models, as explained in the following) and analyzed its single performance (a 1 fold analysis). However, to better compare them, we created a Cross Validation function for Time Series (available in the section 0.1 in the Jupyter Notebook) and consider the data variation across many time periods.

The models used was:

1. Average Model: a simpel model to serve as a baseline to compare if the others are better than the mean; 2. Linear Regression: a statistical technique that fits the best line that minimizes the error in order to predict a depedent continuous variable; 3. Regularized Linear Regression - Lasso: uses shrinkage (where data values are shrunk towards a central point, like the mean), penalizing the features' parameters by adding the absolute value of each parameter in the model; 4. Random Forest Regression: an ensemble model that combines many decision trees to improve prediction; 5. XGBoost Regression: it is also based on decision trees, but uses gradient boosting algorithms.

So, first we fitted the five models. The results is presented bellow.

Model Name	MAE	MAPE	RMSE	Time to run
Random Forest Regressor	676.82	0.10	1005.00	44m
XGBoost Regressor	856.03	0.12	1265.33	26m7s
Average Model	1354.80	0.21	1835.14	341ms
Linear Regression	1867.65	0.29	2671.33	2.51s
Lasso Regression	1891.46	0.29	2742.92	2.17s

Conclusions::

Both Linear and Lasso Regressions performed worse than the Average Model: their errors are greater than the average ones;
So, the data has a complex behavior (non-linear) and, maybe, Linear Models can't learn its behavior;
Regularized Linear Regression performed even worse than Linear Regerssion Model.
Random Forest Regressor model got the smaller errors. However, it took too long to run (and only with 100 estimators)

Then we applied a Time Series Cross Validation, because this method can appropriately consider the data variation across many time periods. The goal is to get the mean error and the standard deviation for all folds tested. The results is shown above.

Model Name	MAE CV	MAPE CV	RMSE CV	Time to Run
Random Forest Regressor	797.21 +/- 147.56	0.11 +/- 0.02	1198.69 +/- 269.98	2h25m
XGBoost Regressor	1028.61 +/- 120.34	0.14 +/- 0.01	1473.61 +/- 211.25	1h39s
Linear Regression	1937.11 +/- 79.38	0.29 +/- 0.02	2745.97 +/- 154.27	13.1s
Lasso	1978.51 +/- 97.02	0.28 +/- 0.01	2849.0 +/- 200.1	29.3s

Model Selection Conclusions: Since XGBoost is the second best model (in terms of errors) and it took less time to run, we decided to finish this cycle using it. Another reason is because after tuning the parameters of teh Random Forest Regressor, it could get even more time to run. Since time is a cost in a business context, we have to consider it when making those decisions. In the second CRISP cycle we can use another model and improve the model's performance.

Step 8: Hyperparameter Fine Tuning

Here we wanted to find the best set of parameters that maximizes the algorithm learning. We made it by applying the Random Search method. The reason we chose it is because it chooses randomly the parameters and then it is more fast. The best parameters was as follows:

param_tuned = { 'n_estimators': 3000, 'eta': 0.03, 'max_depth': 5, 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 3 }

Thus, the new error was markedly less.

Model Name	MAE CV	MAPE CV	RMSE CV
XGBoost Regressor	644.21	0.10	933.16

MAPE improved by 4%, from 14% to 10%.

back to contents

Phase 5: Evaluation

Here we evaluated the model results with some appropriate metrics. Besides that, we translated that metrics to the business field.

Step 9: Translating and Interpreting the the Model Error

This step is about looking at the error and translating it to a business language.

"What's the impact to the business? the model is usefull or I still have to improve it more?". These are examples of the questions we wanted to answer in this phase.

This step is divided into two:

I. Business Performance The average of all predicted sales for the next six weeks gives us the business performance for each store. It was created the best and the worst scenarios based on the summing and the subtracting the Mean Absolute Error (MAE) from the predictions.

These scenarios help the manager better make decisions about the investment on each store and consider the best or the worst scenario.

Scenario	Values
worst_scenario	R$ 286,006,481.05
predictions	R$ 286,728,640.00
best_scenario	R$ 287,450,811.39

Bellow it is presented the eight stores with the highest Mean Absolute Percentage Error (MAPE).

store	predictions	worst_scenario	best_scenario	MAE	MAPE
291	292	105383.86	102061.26	108706.46	3322.60
908	909	237669.67	230022.56	245316.78	7647.11
594	595	344569.97	339579.62	349560.32	4990.35
875	876	207206.31	203287.59	211125.03	3918.72
721	722	357292.44	355184.43	359400.45	2108.01
717	718	201979.52	200086.88	203872.15	1892.64
273	274	193574.20	192156.15	194992.26	1418.05
781	782	221717.41	220967.12	222467.70	750.29

We can see that there are stores that MAPE corresponds more than 50%, which means that predictions are off by more than 50%. Let's look at a scatter plot of MAPE.

The majority of the Mean Absolute Percentage Errors lies between 5% and 20%. Since this is a fictional project, we can't talk to the business team and get their approval to the predictions. So, let's pretend that they approved and keep going.

II. Machine Learning Performance This is the last analysis before the model deployment. Here it was analyzed the overall model performance. To do that, we present five graphs. Starting by the fit of the model, the chart bellow shows that the predictions seem to fit well to the real sales.

The error rate (the ratio between prediction values and observed ones) is presented by the following chart. We can see that it varies around 0.15, which can be considered low to this first cycle. We'll try reduce it in the next CRISP cycle.

It is important to analyze the residuals behaviour when dealing with regression. One of the most important premises of a good model is the residuals to have a normal-shaped distribution with zero mean and constant variance. The following chart shows that the residuals seem to be normal.

This is another chart that helps us analyze the residuals. The expected shape is the residuals concentrated within a 'tube'. Since we can't see any tendency in residuals, it seems to don't have any kind of heteroscedasticity.

The last task in this step is to check the fit of the residuals to the normal distribution. As shown bellow, it's not a perfect fit, but it's good enough to continue in this cycle and we improve it later.

back to contents

Phase 6: Deployment

The model creation and evalutation is not usually the end of the project. The results have to be delivered or presented to the stakeholders. So, that's what this phase is about.

Step 10. Deploying Machine Learning Model to Production: a telegram bot

We decided to present the results on the stakeholder's smartphone. To do that, we deployed the model in a cloud server and we created a Telegram bot to present the results.

We saved the model, the scalings, and the transformations in Heroku, a platform "that enables developers to build, run, and operate applications entirely in the cloud" (see website )

After test the application and requests locally, the bot was created. The production structure is as follows:

How it works:

the user texts the store number to the Telegram Bot;
the Rossmann API (rossmann-bot.py) receives the request and retrieve the data to that store from the test dataset;
the Rossmann API send the data to Handler API (handler.py);
the Handler API gets the data preparation to shape the raw data and generate predictions using the model (model_rossman.pkl);
the Handler returns the predictions to Rossmann API; and,
the Rossmann API returns the sales prediction to the user on Telegram.

The following gif shows the bot receiving the requests and sending back the predictions. The bot is configured to return 'Store Not Available' if the specified store is not in the test dataset and 'Wrogn ID' if the user types something other than a number.

Bot Improvements in the next CRISP-DM cycle

Our team will decide the delivering method in the next cycle, maybe we could use another one. However, there are some additional information that could be added to the bot, like:

a welcome message;
the best and the worst scenarios for the stores;
the total prediction (also the total best and worst scenarios);
display a chart;
request to more than on store;
display a 'wait' message while the request is made.

back to contents

Conclusion

In this project we built an end-to-end sales prediction project going from the initial business understanding to the product deployment to the stakeholder in a bot in Telegram using the CRISP-DM methodology cycles. Besides the business knowledge gained in the Exploratory Data Analysis, we built a model to properly predict the ahead six weeks sales to Rossmann Stores using XGBoost algorithm.

After two months and one day (including this ReadMe creation), I finished the first cycle of this project and I want to highlight two main lessons I learned:

The construction of an end-to-end Data Science solution is challenging, both in terms of business understanding and Machine Learning techniques;
We need more than Python and Statistics to really create a business value with Data Science: we need to know how to solve a business problem; to understand its demand; to predict the challenges we'll face during the journey and mitigate them; and, to develop a suitable solution to a better use by the stakeholders. In summary, we need to have a business perspective when dealing with these projects.

That's all thanks to Meigarom's course.

"Do… or do not. There is no try." - Master Yoda

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
api		api
img		img
model		model
notebooks		notebooks
parameter		parameter
rossmann-telegram-api		rossmann-telegram-api
webapp		webapp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

KattsonBastos/rossmann_sales_prediction

Folders and files

Latest commit

History

Repository files navigation

ROSSMANN Sales Forecasting: An End-to-End Data Science Project Using Machine Learning for Sales Prediction

Special Mention

Contents

A Brief Introduction to the ROSSMANN Company

Project Methodology: CRISP-DM

CRISP-DM Cycle

Phase 1: Business Understanding

Phase 2: Data Understanding

Phase 3: Data Preparation

Phase 4: Modeling

Phase 5: Evaluation

Phase 6: Deployment

Next Cycle

Phase 1: Business Understanding

The Business Context

Phase 2: Data Understanding

Step 1: Data Collection and Description

Data Collection

Training Data Dimensions:

Variables Description and types:

Data Cleaning: Imputation and Changing Types

Descriptive Statistics

Step 2: Hypothesis Creation and Feature Engineering

Hypothesis Mind Map

Hypothesis Creation

Feature Engineering

Step 3: Filtering Variables and Rows

Step 4: Exploratory Data Analysis (EDA)

Univariate Analysis

Response Variable (sales):

Numerical Variables:

Categorical Variables:

Bivariate Analysis

Multivariate Analysis

Numerical Variables

Categorical Variables

Phase 3: Data Preparation

Step 5: Data Preparation

Normalization

Rescaling

Transformation

Step 6: Feature Selection

Phase 4: Modeling

Step 7: Machine Learning Modeling

Step 8: Hyperparameter Fine Tuning

Phase 5: Evaluation

Step 9: Translating and Interpreting the the Model Error

Phase 6: Deployment

Step 10. Deploying Machine Learning Model to Production: a telegram bot

Bot Improvements in the next CRISP-DM cycle

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Response Variable (`sales`):

Packages