Driven by Data

Joh Akaishi

This repository contains notebooks for my machine learning project, “Driven by Data”.
Notebooks can be found below:

Web Scraping
EDA
Modelling: linear regression (Including regularised models.)
Modelling: tree based regressors (Decision Tree, Random Forest, Extra Trees Regressor.)
Modelling: boosting regressors (AdaBoost, GradientBoost, HistGradientBoost, LightGBM.)
Functions script (Functions have been defined within myfunctions.py, which are called upon in the notebooks above.)

1. Problem Statement

“What is my car worth?”

There are many factors (or “features”…) that affect the value of a car. This project aims to dive deep (or should I say "drive" deep?) into this very question, and to use a data-driven approach to accurately predict the listed price of cars.

Autotrader is one of the largest automotive classified advertising sites in the UK, listing both new and second hand cars sold by private sellers and trade dealers. I extracted the data for this project by scraping through the Autotrader website, resulting in approximately 15,000 unique (non-duplicated) observations, on which to perform exploratory analysis and predict prices using regression models.

(back to top)

2. Project Flow

(back to top)

3. Web Scraping, Feature Extraction and Cleaning

Link to notebook

Web Scraping

The data for this project was scraped from the Autotrader website using Beautiful Soup, Soup Strainer and Cloudscraper.

Scraped fields:

Scraped Fields	Description
scr_price	Listed car price
scr_header	Header of the advert
scr_attngrab	"Attention grabbing" one liner!
scr_descr	Description of the vehicle (incl. year, bodytype, mileage, engine size, horsepower, transmission, fuel type, ULEZ compliance, number of previous owners)
scr_sellerrating	Rating of the seller (5-star scale)
scr_sellertown	Location of the seller
scr_sellertype	Type of the seller (either private or trade seller)
scr_url	URL of the unique advert

The scraped dataframe resulted in 62,082 observations, across 8 columns:

Dropping duplicates and null values

Below is a visual representation of null values, using the missingno package.
White space represents null values in the dataset.

As the data was gathered through web scraping, there were duplicate rows within the dataset.
Firstly, these duplicates were dropped, and then rows containing null values were also dropped.

# original df_master
print('before dropping duplicates:', df_master.shape)

# drop duplicates
df = df_master.drop_duplicates(ignore_index=True)
print('after dropping duplicates:', df.shape)
#df.head()

# drop null values
df = df.dropna().reset_index()
df.drop(columns='index', inplace=True)

print('after dropping null values:', df.shape)
display(df.head())

Feature Extraction and Cleaning

The following features were extracted from the scr_descr column, using regex:

Feature	Description	Extracted from
make	Make (manufacturer) of the vehicle	'scr_header'
model	Model of the vehicle	'scr_header'
year	Year of manufacture	'scr_descr'
reg_num	Registration number (related to year)	'scr_descr'
body	Body type of vehicle ('hatchback', 'estate', 'SUV', etc)	'scr_descr'
mileage	Mileage on the vehicle odometer	'scr_descr'
engine_size	Size of the engine (litres)	'scr_descr'
horsepower	Breaking horsepower	'scr_descr'
transmission	Transmission type (AT or MT)	'scr_descr'
fuel	Fuel type of the vehicle (petrol or diesel)	'scr_descr'
owners	Number of previous owners of the vehicle	'scr_descr'
ulez	Whether the vehicle is ULEZ compliant or not (Ultra Low Emission Zone; a charge for polluting vehicles in central London)	'scr_descr'
seller_type	Type of seller ('trade seller' or 'approved dealer')	'scr_sellertype'
price	Listed price of the vehicle advert in integer type	'scr_price'

Example code to extract features:

# extract info from description column, to generate new columns
def year(x):
    pattern = re.compile(r'(\d{4})')
    element = re.findall(pattern, x)
    try:
        return int(element[0])
    except:
        return np.nan

def reg_num(x):
    pattern = re.compile(r'(\d\d reg)')
    element = re.findall(pattern, x)
    try:
        return int(element[0].split()[0])
    except:
        return np.nan

df['year'] = df.scr_descr.apply(lambda x: year(x))
df['reg_num'] = df.scr_descr.apply(lambda x: reg_num(x))

There were some entries where the engine size was picked up for the 'body' column:

Corrected as below:

# correct body style from engine size (eg 1.5l, 2.2l) to actual body style
def body_style_corrected(x):
    list_x = ast.literal_eval(x)
    return list_x[0].lower()

df.loc[df.body.str.contains('(\d.\dl)'), 'body'] = df.scr_descr.apply(lambda x: body_style_corrected(x))

# df = df where 'body' does not contain digits (ie delete all rows with digits in body column)
df = df[~df.body.str.contains('(\d)')]

# reset index inplace
df.reset_index(drop=True, inplace=True)

During extraction of 'make', it was found that those observations with multi-word manufacturer names only registered the first word.

'land' instead of 'land rover'
'aston' instead of 'aston martin'
'alfa' instead of 'alfa romeo'
etc.

This was corrected using index replacement, as below:

# correct make names for those with more than one word as the make name
double_car_names = ['land', 'aston', 'alfa', 'great']

indexer = df[df.make == 'land'].index
df.iloc[indexer, -2] = 'landrover'

indexer = df[df.make == 'aston'].index
df.iloc[indexer, -2] = 'astonmartin'

indexer = df[df.make == 'alfa'].index
df.iloc[indexer, -2] = 'alfaromeo'

indexer = df[df.make == 'great'].index
df.iloc[indexer, -2] = 'greatwall'

The resulting dataframe null values after the cleaning process:

Many of the 'owners' column had missing values, so this column was dropped (although there is a risk that the number of previous owners does have some correlation with the target: price).
After dropping 'owners', all remaining rows with null values were dropped.
The final dataframe with 14,996 observations (all unique):

(back to top)

4. Exploratory Data Analysis

Link to notebook

Distribution of Price

As expected (for car prices), there is a heavy right skew showing the presence of a few outliers that are very expensive, some even reaching GBP 600,000!

Taking the logarithm of the price, results in an approximately normal distribution.

Correlation Heatmap and Scatterplots

Perhaps expectedly, with respect to price:
- strong positive correlation with engine_size and horsepower.
- negative correlation with mileage (cars with higher mileage on the odometer will sell for less).

Larger variance in price, when engine size increases, but smaller engine sizes will almost certainly result in a lower price.
"Vertical lines" can be seen in the plot, also as engine sizes are stated down to one decimal place (eg 2.8L engine).

As horsepower (BHP) increases, the price also increases.
It seems there is an almost exponential relationship, which could be addressed using polynomial features, or a power of the horsepower to achieve a more linear relationship with price.

As mileage increases, there is a depreciation in price (resembling almost an exponential decay).

Also understandably, there is a correlation between BHP and engine_size.

Registration numbers correspond with the year, as can be seen in the plot above.
For this reason, reg_num was dropped from the dataframe, to avoid issues with strong multicollinearity.

(back to top)

5. Modelling

Link to notebook: Linear regression with Ridge and Lasso regularisation

Link to notebook: Tree based regressors (Decision Tree, Random Forest, Extra Trees)

Link to notebook: Boosting regressors (AdaBoost, GradientBoost, HistGradientBoost, LightGBM)

Link to 'myfunctions.py' Functions have been defined within myfunctions.py, which are called upon in the notebooks above.

Preprocessing

Train Test Split

Dataframe split to have 20% set aside as the test set.

Pipeline

A "pipeline construction" function was defined with the following transformers:
- Standardisation for continuous variables
- One Hot Encoding for categorical variables

def pipe_construct(features_cont=features_cont, features_cat=features_cat, model=LinearRegression()):
    '''
    Input: continuous features, categorical features, model of choice. 
    Returns: pipeline.
    '''
    t = [('cont', StandardScaler(), features_cont), 
         ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), features_cat)]

    transformer = ColumnTransformer(transformers=t, remainder='drop')

    pipe = Pipeline(steps=[('transformer', transformer), ('model', model)])
    return pipe

The pipe was then called upon in the following way (input features were selected in different combinations).

# Define input features 
features_cont = ['year', 'mileage', 'engine_size', 'horsepower', 'bhp_per_litre']
features_cat = [i for i in df.columns if i not in features_cont]
features_cat.remove('price') # remove target variable

# fit pipe
pipe = pipe_construct(features_cont, features_cat, LinearRegression())
pipe.fit(X_train, y_train)

Ridge Regression

The (5-fold) cross validated mean $R^{2}$ score (hereafter "CV mean") was used to assess model performance and generalisability to new, unknown data.

Ridge Regression ( $\alpha$ = 0.3594) resulted in the highest initial CV mean of 0.9219

The plots above show the actual against predicted values for the train and test set.
There's already a relatively good fit, but it can be seen there are some negatively predicted values, as a result of linear relationships being inferred with the features.

Distribution of residuals show an approximately normal distribution, albeit with a few extreme residuals (approx. -300,000 and +150,000).
Aside from this, the standardised residuals also show a relatively even pattern, with the exception of a few residuals at very high predicted prices (>400,000) which deviate quite far from zero.
Homoscedasticity therefore not quite attained.

Extra Trees Regressor

After fitting various ensemble models, the Extra Trees Regressor resulted in the highest CV mean of 0.9586

The RMSE and MAE were also computed.
RMSE showed relatively high values, perhaps owing to the presence of outliers (very expensive cars) within the dataset.
MAE of the train set and test set were 22.69 and 1726.36 (in GBP) respectively.

As seen from the plot above, the model shows signs of heavy overfitting on the training set.
However as the CV mean was also scoring highly, it was the model of choice for generalisability.

The plots of residuals above, shows again that they follow an approximately normal distribution with the exception of a few extreme outliers.
The standardised residuals plot shows some deviation of predictions when the predicted price was below GBP 100,000 and some large deviations at the GBP 300,000 mark (train set).
The standardised residuals for the test set show relatively tight deviations at predicted prices below GBP 50,000 (test set).

The feature importances above show relative feature importances, with the sum of all features adding to 1.0
'Horsepower' is of highest importance, closely followed by 'engine_size'.
- In the future, another feature could be computed from horsepower and engine size (eg 'BHP per litre'), and used instead, as these two features are correlated with each other.
It can also be seen that the make of the car (eg 'Ferrari', 'Bentley', and 'Rolls-Royce') are of relatively high importance.
- 'Martin' here is suspected to be mistaken as the the model name, from the make 'Aston Martin'.

(back to top)

6. Conclusions

Highest scoring models:

Model	CV mean (5 fold)	MAE (train)	MAE (test)	RMSE (train)	RMSE (test)
Extra Trees Regressor	0.9586	22.69	1726.36	1094.07	5286.36
Random Forest Regressor	0.9441	745.38	2048.36	2637.77	7179.35

RMSE showed relatively high values, perhaps owing to the presence of outliers (very expensive cars) within the dataset.
Extra Trees Regressor: MAE of the train set and test set were 22.69 and 1726.36 (in GBP) respectively.

Features:

Horsepower seems to be the most important feature for predicting the price of a vehicle, closely followed by other features such as 'engine_size', 'mileage' and 'year'.
The make of the car, understandably, also ranks highly for predictive power on sales price.

Limitations:

This project is somewhat limited due to the fact that the number of observations (~15,000) was relatively small compared to the 400,000 listed cars on Autotrader.
- This was a limitation borne from the web scraping process.

Future Expansion:

Further feature engineering such as implementing polynomial features may assist to improve model accuracy.
- This is suggested from the trends found within the initial EDA.
- Also implementing cross features in place of two features (such as 'BHP per litre', to replace 'horsepower' and 'engine_size') may simplify the model.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
readme_assets		readme_assets
Autotrader_1_scraping_cleaning.ipynb		Autotrader_1_scraping_cleaning.ipynb
Autotrader_2_eda.ipynb		Autotrader_2_eda.ipynb
Autotrader_3_linreg.ipynb		Autotrader_3_linreg.ipynb
Autotrader_4_dtr_rfr_etr.ipynb		Autotrader_4_dtr_rfr_etr.ipynb
Autotrader_5_adaboost_gradientboost_histgbr_lightgbm.ipynb		Autotrader_5_adaboost_gradientboost_histgbr_lightgbm.ipynb
Autotrader_ttr.ipynb		Autotrader_ttr.ipynb
Autotrader_ttr_with_values.ipynb		Autotrader_ttr_with_values.ipynb
README.md		README.md
myfunctions.py		myfunctions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Driven by Data

Joh Akaishi

Table of Contents

1. Problem Statement

2. Project Flow

3. Web Scraping, Feature Extraction and Cleaning

Web Scraping

Dropping duplicates and null values

Feature Extraction and Cleaning

4. Exploratory Data Analysis

Distribution of Price

Correlation Heatmap and Scatterplots

5. Modelling

Preprocessing

Train Test Split

Pipeline

Ridge Regression

Extra Trees Regressor

6. Conclusions

Highest scoring models:

Features:

Limitations:

Future Expansion:

About

Releases

Packages

Languages

johakaishi/driven_by_data

Folders and files

Latest commit

History

Repository files navigation

Driven by Data

Joh Akaishi

Table of Contents

1. Problem Statement

2. Project Flow

3. Web Scraping, Feature Extraction and Cleaning

Web Scraping

Dropping duplicates and null values

Feature Extraction and Cleaning

4. Exploratory Data Analysis

Distribution of Price

Correlation Heatmap and Scatterplots

5. Modelling

Preprocessing

Train Test Split

Pipeline

Ridge Regression

Extra Trees Regressor

6. Conclusions

Highest scoring models:

Features:

Limitations:

Future Expansion:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages