Bag Hunter

Sabrina Yang

Abstract

The goal of this project is to use data engineering and deep learning to do image classification on different luxury brands of bags and deploy the model to a web application for the user.

The main function is to identify its brand by the images of the bags, and it can benefit the e-commerce companies such as online retailers and reseller businesses.

Design

API web scrapping + MongoDB
Preprocessing
Deep Learning
Pipeline processing framework to auto renew database and retrain the model
Deploy the processing framework as an app service

Data

This dataset is downloaded from FARFETCH by web scrapping through API. One of the reasons I chose Farfetch was that their images are high resolution with clear cutout and model wear, and I think it is a good resource to build on the image classification model.

The original website dataset has over 400,000 items downloaded in MongoDB. The goal of my project is to focus on bag sections which have 21,516 items with 669 different brands from the database, and I used MongoDB to find the top list of bags brands and pick 5 out of the list – the final data size is 1670 with 5 different brands – YSL, Prada, Gucci, Hermes, and LV.

EDA image

Methodology

0. API web scrapping + MongoDB

Data ingestion:
use a Python wrapper of an API to pull JSON files from Farfetch.com that can be read directly into a Mongo database for data acquisition, cleaning and eda.
- a. ingest new data: the code can work and properly update the database
- b. Database quality control: run it bi-weekly and on only the first page since the renewal items will be added on Page 1 with not too heavy frequency – the pipeline can be run in the balance of time and database quality
Data storage: MongoDB is NoSQL database, which is suitable to load a series of JSON files with data on product info directly into a MongoDB collection.

1. Preprocessing

Data Directories Setup: setup folders with the notebook to get data from path folders. Directly download products images through MongoDB product images URL link
Images Preprocessing: using keras.preprocessing & ImageDataGenerator

2. Deep Learning

Transfer Learning (VGG16) : batch size = 32, added 3 Dense Layers, epochs = 20, and tried epochs = 100 + callbacks.EarlyStopping + callbacks.ReduceLROnPlateau
Image Augmentation: since the dataset is small so I use this method to flip to increase dataset size(rotation = 40, horizontal flip), however, the test accuracy score is higher than the training score which shows overfitting, so I didn’t use this approach to continue my training.
Multi classification: I tried to increase the classes number as 3,5,6 and 8 different brands to classify, and the accuracy scores are around 70-80%, but since the data size is small and given Image Augmentation had an overfit issue in my case, I decided to take 5 different brands model as the final deployment.

3. Pipeline processing framework to auto renew database and retrain model

develop reusable python code for the whole process and saved the model to reuse/retrain
set cron job to command API to renew MongoDB data bi-weekly
as above, after collecting the new feeding data for every 3 months, the deep learning model is set to connect the database and retrain the model to renew quarterly/seasonally

4. Deploy the processing framework as an app service

use the model to apply on streamlit for end-user to identify bags’ brands by uploading the image(supported jpg/ png/ jpeg formats)

Workflow

Tools

Numpy & Pandas: data manipulation
MongoDB : data storage
keras : transfer Learning package & image preprocessing
streamlit: web app deployment

Results

1. Final model accuracy score: 0.7838

a. accuracy & loss chart

b. confusion matrix

c. classification report

2. Streamlit app screenshot

upload the bag image and it shows the correct prediction with cheerful balloons animation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writeup.md

Writeup.md

Bag Hunter

Abstract

Design

Data

EDA image

Methodology

Tools

Results

1. Final model accuracy score: 0.7838

2. Streamlit app screenshot

Files

Writeup.md

Latest commit

History

Writeup.md

File metadata and controls

Bag Hunter

Abstract

Design

Data

EDA image

Methodology

Tools

Results

1. Final model accuracy score: 0.7838

2. Streamlit app screenshot