Skip to content

Latest commit

 

History

History
122 lines (59 loc) · 4.65 KB

Writeup.md

File metadata and controls

122 lines (59 loc) · 4.65 KB

Bag Hunter

Sabrina Yang

Abstract

The goal of this project is to use data engineering and deep learning to do image classification on different luxury brands of bags and deploy the model to a web application for the user.

The main function is to identify its brand by the images of the bags, and it can benefit the e-commerce companies such as online retailers and reseller businesses.

Design

  1. API web scrapping + MongoDB
  2. Preprocessing
  3. Deep Learning
  4. Pipeline processing framework to auto renew database and retrain the model
  5. Deploy the processing framework as an app service

Data

This dataset is downloaded from FARFETCH by web scrapping through API. One of the reasons I chose Farfetch was that their images are high resolution with clear cutout and model wear, and I think it is a good resource to build on the image classification model.

The original website dataset has over 400,000 items downloaded in MongoDB. The goal of my project is to focus on bag sections which have 21,516 items with 669 different brands from the database, and I used MongoDB to find the top list of bags brands and pick 5 out of the list – the final data size is 1670 with 5 different brands – YSL, Prada, Gucci, Hermes, and LV.

EDA image

Methodology

0. API web scrapping + MongoDB

  • Data ingestion:
    use a Python wrapper of an API to pull JSON files from Farfetch.com that can be read directly into a Mongo database for data acquisition, cleaning and eda.

    • a. ingest new data: the code can work and properly update the database
    • b. Database quality control: run it bi-weekly and on only the first page since the renewal items will be added on Page 1 with not too heavy frequency – the pipeline can be run in the balance of time and database quality
  • Data storage: MongoDB is NoSQL database, which is suitable to load a series of JSON files with data on product info directly into a MongoDB collection.

1. Preprocessing

  • Data Directories Setup: setup folders with the notebook to get data from path folders. Directly download products images through MongoDB product images URL link

  • Images Preprocessing: using keras.preprocessing & ImageDataGenerator

2. Deep Learning

  • Transfer Learning (VGG16) : batch size = 32, added 3 Dense Layers, epochs = 20, and tried epochs = 100 + callbacks.EarlyStopping + callbacks.ReduceLROnPlateau

  • Image Augmentation: since the dataset is small so I use this method to flip to increase dataset size(rotation = 40, horizontal flip), however, the test accuracy score is higher than the training score which shows overfitting, so I didn’t use this approach to continue my training.

  • Multi classification: I tried to increase the classes number as 3,5,6 and 8 different brands to classify, and the accuracy scores are around 70-80%, but since the data size is small and given Image Augmentation had an overfit issue in my case, I decided to take 5 different brands model as the final deployment.

3. Pipeline processing framework to auto renew database and retrain model

  • develop reusable python code for the whole process and saved the model to reuse/retrain
  • set cron job to command API to renew MongoDB data bi-weekly
  • as above, after collecting the new feeding data for every 3 months, the deep learning model is set to connect the database and retrain the model to renew quarterly/seasonally

4. Deploy the processing framework as an app service

  • use the model to apply on streamlit for end-user to identify bags’ brands by uploading the image(supported jpg/ png/ jpeg formats)

Workflow

Tools

  1. Numpy & Pandas: data manipulation
  2. MongoDB : data storage
  3. keras : transfer Learning package & image preprocessing
  4. streamlit: web app deployment

Results

1. Final model accuracy score: 0.7838

a. accuracy & loss chart

b. confusion matrix

c. classification report

2. Streamlit app screenshot

upload the bag image and it shows the correct prediction with cheerful balloons animation.