Skip to content

Latest commit

 

History

History
420 lines (268 loc) · 16.5 KB

README.md

File metadata and controls

420 lines (268 loc) · 16.5 KB

Driven by Data

Joh Akaishi

This repository contains notebooks for my machine learning project, “Driven by Data”.
Notebooks can be found below:

Table of Contents


1. Problem Statement
2. Project Flow
3. Web Scraping, Feature Extraction and Cleaning
4. Exploratory Data Analysis
5. Modelling
6. Conclusions and Future Expansion

1. Problem Statement


“What is my car worth?”

There are many factors (or “features”…) that affect the value of a car. This project aims to dive deep (or should I say "drive" deep?) into this very question, and to use a data-driven approach to accurately predict the listed price of cars.

Autotrader is one of the largest automotive classified advertising sites in the UK, listing both new and second hand cars sold by private sellers and trade dealers. I extracted the data for this project by scraping through the Autotrader website, resulting in approximately 15,000 unique (non-duplicated) observations, on which to perform exploratory analysis and predict prices using regression models.

(back to top)

2. Project Flow


(back to top)

3. Web Scraping, Feature Extraction and Cleaning


Link to notebook

Web Scraping

The data for this project was scraped from the Autotrader website using Beautiful Soup, Soup Strainer and Cloudscraper.

Scraped fields:

Scraped Fields Description
scr_price Listed car price
scr_header Header of the advert
scr_attngrab "Attention grabbing" one liner!
scr_descr Description of the vehicle
(incl. year, bodytype, mileage, engine size, horsepower, transmission, fuel type, ULEZ compliance, number of previous owners)
scr_sellerrating Rating of the seller (5-star scale)
scr_sellertown Location of the seller
scr_sellertype Type of the seller (either private or trade seller)
scr_url URL of the unique advert



The scraped dataframe resulted in 62,082 observations, across 8 columns:

Dropping duplicates and null values

Below is a visual representation of null values, using the missingno package.
White space represents null values in the dataset.

  • As the data was gathered through web scraping, there were duplicate rows within the dataset.
  • Firstly, these duplicates were dropped, and then rows containing null values were also dropped.
# original df_master
print('before dropping duplicates:', df_master.shape)

# drop duplicates
df = df_master.drop_duplicates(ignore_index=True)
print('after dropping duplicates:', df.shape)
#df.head()

# drop null values
df = df.dropna().reset_index()
df.drop(columns='index', inplace=True)

print('after dropping null values:', df.shape)
display(df.head())

Feature Extraction and Cleaning

The following features were extracted from the scr_descr column, using regex:

Feature Description Extracted from
make Make (manufacturer) of the vehicle 'scr_header'
model Model of the vehicle 'scr_header'
year Year of manufacture 'scr_descr'
reg_num Registration number (related to year) 'scr_descr'
body Body type of vehicle ('hatchback', 'estate', 'SUV', etc) 'scr_descr'
mileage Mileage on the vehicle odometer 'scr_descr'
engine_size Size of the engine (litres) 'scr_descr'
horsepower Breaking horsepower 'scr_descr'
transmission Transmission type (AT or MT) 'scr_descr'
fuel Fuel type of the vehicle (petrol or diesel) 'scr_descr'
owners Number of previous owners of the vehicle 'scr_descr'
ulez Whether the vehicle is ULEZ compliant or not (Ultra Low Emission Zone; a charge for polluting vehicles in central London) 'scr_descr'
seller_type Type of seller ('trade seller' or 'approved dealer') 'scr_sellertype'
price Listed price of the vehicle advert in integer type 'scr_price'

Example code to extract features:

# extract info from description column, to generate new columns
def year(x):
    pattern = re.compile(r'(\d{4})')
    element = re.findall(pattern, x)
    try:
        return int(element[0])
    except:
        return np.nan

def reg_num(x):
    pattern = re.compile(r'(\d\d reg)')
    element = re.findall(pattern, x)
    try:
        return int(element[0].split()[0])
    except:
        return np.nan

df['year'] = df.scr_descr.apply(lambda x: year(x))
df['reg_num'] = df.scr_descr.apply(lambda x: reg_num(x))

There were some entries where the engine size was picked up for the 'body' column:

Corrected as below:

# correct body style from engine size (eg 1.5l, 2.2l) to actual body style
def body_style_corrected(x):
    list_x = ast.literal_eval(x)
    return list_x[0].lower()

df.loc[df.body.str.contains('(\d.\dl)'), 'body'] = df.scr_descr.apply(lambda x: body_style_corrected(x))

# df = df where 'body' does not contain digits (ie delete all rows with digits in body column)
df = df[~df.body.str.contains('(\d)')]

# reset index inplace
df.reset_index(drop=True, inplace=True)

During extraction of 'make', it was found that those observations with multi-word manufacturer names only registered the first word.

  • 'land' instead of 'land rover'
  • 'aston' instead of 'aston martin'
  • 'alfa' instead of 'alfa romeo'
  • etc.

This was corrected using index replacement, as below:

# correct make names for those with more than one word as the make name
double_car_names = ['land', 'aston', 'alfa', 'great']

indexer = df[df.make == 'land'].index
df.iloc[indexer, -2] = 'landrover'

indexer = df[df.make == 'aston'].index
df.iloc[indexer, -2] = 'astonmartin'

indexer = df[df.make == 'alfa'].index
df.iloc[indexer, -2] = 'alfaromeo'

indexer = df[df.make == 'great'].index
df.iloc[indexer, -2] = 'greatwall'

The resulting dataframe null values after the cleaning process:

  • Many of the 'owners' column had missing values, so this column was dropped (although there is a risk that the number of previous owners does have some correlation with the target: price).
  • After dropping 'owners', all remaining rows with null values were dropped.

  • The final dataframe with 14,996 observations (all unique):

(back to top)

4. Exploratory Data Analysis


Link to notebook

Distribution of Price

  • As expected (for car prices), there is a heavy right skew showing the presence of a few outliers that are very expensive, some even reaching GBP 600,000!

  • Taking the logarithm of the price, results in an approximately normal distribution.

Correlation Heatmap and Scatterplots

  • Perhaps expectedly, with respect to price:
    • strong positive correlation with engine_size and horsepower.
    • negative correlation with mileage (cars with higher mileage on the odometer will sell for less).

  • Larger variance in price, when engine size increases, but smaller engine sizes will almost certainly result in a lower price.
  • "Vertical lines" can be seen in the plot, also as engine sizes are stated down to one decimal place (eg 2.8L engine).

  • As horsepower (BHP) increases, the price also increases.
  • It seems there is an almost exponential relationship, which could be addressed using polynomial features, or a power of the horsepower to achieve a more linear relationship with price.

  • As mileage increases, there is a depreciation in price (resembling almost an exponential decay).

  • Also understandably, there is a correlation between BHP and engine_size.

  • Registration numbers correspond with the year, as can be seen in the plot above.
  • For this reason, reg_num was dropped from the dataframe, to avoid issues with strong multicollinearity.

(back to top)

5. Modelling


Link to notebook: Linear regression with Ridge and Lasso regularisation

Link to notebook: Tree based regressors (Decision Tree, Random Forest, Extra Trees)

Link to notebook: Boosting regressors (AdaBoost, GradientBoost, HistGradientBoost, LightGBM)

Link to 'myfunctions.py' Functions have been defined within myfunctions.py, which are called upon in the notebooks above.

Preprocessing

Train Test Split

  • Dataframe split to have 20% set aside as the test set.

Pipeline

  • A "pipeline construction" function was defined with the following transformers:
    • Standardisation for continuous variables
    • One Hot Encoding for categorical variables
def pipe_construct(features_cont=features_cont, features_cat=features_cat, model=LinearRegression()):
    '''
    Input: continuous features, categorical features, model of choice. 
    Returns: pipeline.
    '''
    t = [('cont', StandardScaler(), features_cont), 
         ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), features_cat)]

    transformer = ColumnTransformer(transformers=t, remainder='drop')

    pipe = Pipeline(steps=[('transformer', transformer), ('model', model)])
    return pipe

  • The pipe was then called upon in the following way (input features were selected in different combinations).
# Define input features 
features_cont = ['year', 'mileage', 'engine_size', 'horsepower', 'bhp_per_litre']
features_cat = [i for i in df.columns if i not in features_cont]
features_cat.remove('price') # remove target variable

# fit pipe
pipe = pipe_construct(features_cont, features_cat, LinearRegression())
pipe.fit(X_train, y_train)

Ridge Regression

The (5-fold) cross validated mean score (hereafter "CV mean") was used to assess model performance and generalisability to new, unknown data.

Ridge Regression ( = 0.3594) resulted in the highest initial CV mean of 0.9219

  • The plots above show the actual against predicted values for the train and test set.

  • There's already a relatively good fit, but it can be seen there are some negatively predicted values, as a result of linear relationships being inferred with the features.

  • Distribution of residuals show an approximately normal distribution, albeit with a few extreme residuals (approx. -300,000 and +150,000).

  • Aside from this, the standardised residuals also show a relatively even pattern, with the exception of a few residuals at very high predicted prices (>400,000) which deviate quite far from zero.

  • Homoscedasticity therefore not quite attained.

Extra Trees Regressor

After fitting various ensemble models, the Extra Trees Regressor resulted in the highest CV mean of 0.9586

  • The RMSE and MAE were also computed.

  • RMSE showed relatively high values, perhaps owing to the presence of outliers (very expensive cars) within the dataset.

  • MAE of the train set and test set were 22.69 and 1726.36 (in GBP) respectively.

  • As seen from the plot above, the model shows signs of heavy overfitting on the training set.
  • However as the CV mean was also scoring highly, it was the model of choice for generalisability.

  • The plots of residuals above, shows again that they follow an approximately normal distribution with the exception of a few extreme outliers.

  • The standardised residuals plot shows some deviation of predictions when the predicted price was below GBP 100,000 and some large deviations at the GBP 300,000 mark (train set).

  • The standardised residuals for the test set show relatively tight deviations at predicted prices below GBP 50,000 (test set).

  • The feature importances above show relative feature importances, with the sum of all features adding to 1.0

  • 'Horsepower' is of highest importance, closely followed by 'engine_size'.

    • In the future, another feature could be computed from horsepower and engine size (eg 'BHP per litre'), and used instead, as these two features are correlated with each other.
  • It can also be seen that the make of the car (eg 'Ferrari', 'Bentley', and 'Rolls-Royce') are of relatively high importance.

    • 'Martin' here is suspected to be mistaken as the the model name, from the make 'Aston Martin'.

(back to top)

6. Conclusions


Highest scoring models:

Model CV mean (5 fold) MAE (train) MAE (test) RMSE (train) RMSE (test)
Extra Trees Regressor 0.9586 22.69 1726.36 1094.07 5286.36
Random Forest Regressor 0.9441 745.38 2048.36 2637.77 7179.35

  • RMSE showed relatively high values, perhaps owing to the presence of outliers (very expensive cars) within the dataset.

  • Extra Trees Regressor: MAE of the train set and test set were 22.69 and 1726.36 (in GBP) respectively.

Features:

  • Horsepower seems to be the most important feature for predicting the price of a vehicle, closely followed by other features such as 'engine_size', 'mileage' and 'year'.

  • The make of the car, understandably, also ranks highly for predictive power on sales price.

Limitations:

  • This project is somewhat limited due to the fact that the number of observations (~15,000) was relatively small compared to the 400,000 listed cars on Autotrader.
    • This was a limitation borne from the web scraping process.

Future Expansion:

  • Further feature engineering such as implementing polynomial features may assist to improve model accuracy.
    • This is suggested from the trends found within the initial EDA.
    • Also implementing cross features in place of two features (such as 'BHP per litre', to replace 'horsepower' and 'engine_size') may simplify the model.

(back to top)