Extreme Classification

Extreme classification deals with multi-class and multi-label problems involving an extremely large number of choices. This repository contains a classifer that has been trained to classify products on Amazon into their Node ID's. The dataset used was obtained from Amazon ML Challenge 2021.

Browse node ID's are numeric codes that identify inside Amazon, a given product category. There are more than 30 thousand product categories on Amazon, each one identified by a unique Node ID. In Amazon's own words

Browse Node ID's are positive integers that uniquely identify product sets, such as Literature & Fiction: (17), Medicine: (13996), Mystery & Thrillers: (18), Nonfiction: (53), Outdoors & Nature: (290060). Amazon uses thousands of browse node ID's

Approach

The input dataframe is cleaned and a custom BytePairEncoding (BPE) tokenizer from HuggingFace tokenizers is trained on the corpus. The text is then tokenized and the FastText library is used for learn text representation and performing classification. It is observed that entire process takes about 45 minutes. For an in depth explanation, take a look at this notebook

Demo

Check out the demo on Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
FastTextClassifier.ipynb		FastTextClassifier.ipynb
LICENSE		LICENSE
README.md		README.md
app.py		app.py
fasttext-flowchart.png		fasttext-flowchart.png
fasttext-model.ftz		fasttext-model.ftz
requirements.txt		requirements.txt
tokenizer.json		tokenizer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extreme Classification

Approach

Demo

About

Languages

License

SupreethRao99/eXtreme-Classification

Folders and files

Latest commit

History

Repository files navigation

Extreme Classification

Approach

Demo

About

Topics

Resources

License

Stars

Watchers

Forks

Languages