Note : This readme present a summary of the project with key insights and the final reusable models. If you want to access only the documentation for model and evaluation classes, click here.
- Repo Summary
- Context & Data
- Machine Learning Modeling
- Results
- XGFood (Final model class)
- Evaluator (Evaluation Class)
- Limitations, possible areas for improvements & future work
- EDA_and_Preprocessing : Notebooks
.ipynb
with features & labels preprocessing - ML Models : Notebooks
.ipynb
with model selection & evaluation - food_model : Python scripts
.py
with models & evaluation class :
This project was carried out by 5 students for 5 months (February - June 2021), as part of a specialization in AI at emlyon business school (Lyon, France). Lucain (github), an Open Food Facts contributor; guided, helped and mentored us during this project (and we thank him very much!).
Labeling is a time-consuming and expensive task, but is needed for many applications. More than 50% of Open Food Facts products are unlabelled, which make some useful and interesting applications more difficult or impossible (statistics, research, scores calculation -nutri score for example-, etc).
The Open Food Facts database was, at the time of this project, composed by appr. 1.7 millions products and 185 columns with various informations and filled percentage.
Our goal was to predict a "high-level" category of products with this database. We have 2 labels to predict :
- The first group (often denoted as G1 or pnns_groups_1), composed of 9 unique labels (can be beverages, salty snacks, etc)
- The second group (oftend denoted as G2 or pnns_groups_2), composed of 38 unique labels. These are sub categories of group 1.
Full list of labels members is accessible here for group 1 and here for group 2.
We decided to keep 2 features : ingredients
and product_name
. We then create a total of 938 new features, based on most frequent ingredients and words.
Our features engineering & preprocessing method give deployable results without using information that is less frequently completed by open food facts userd, like nutritional values.
However, our method has some limitations and could be largely improved : more on that at later.
Our unprocessed dataset is therefore composed with 167 548 products, 2 features and 2 targets :
product_name
(String) : feature - name of productingredients
(Object) : feature - ingredients list in structured format (list of dicts)pnns_groups_1
(Category) : target - G1 Label (9 unique categories)pnns_groups_2
(Category) : target - G2 Label (38 unique categories)
We end up with features processing methods and get 938 features (450 related to ingredients and 488 related to words). We keep 110K products for training, 28K for validation and 25K for test. Splits are stratified over G2 (38 labels).
We tried linear models and support vectors machines with dimensionality reduction (PCA), tree-based bagging and boosting methods (random forests, adaboost, xgboost) with different frameworks : One-vs-rest and chain or parallel models for multi-output for example.
Here is a short summary of scores (predicting G1) for models :
(Pipeline) - Model | Best macro precision | Notes |
---|---|---|
Random Forest | 85.45% | Tested with 2K trees or small forest in one-vs-rest |
(PCA) - Logistic Regression | 69.31% | Tested with Lasso, Ridge and Elastic-net penalties |
(PCA) - Support Vectors Machines | 69.77% | Tested with various C (range 0.5, 3) and RBF kernel |
Adaboost | 83.11% | Tested with Decision Tree for base estimator |
XGBoost | 85.79% | Used cart gradient boosted trees with hist. method |
We finally opted for a chain classification with XGBoost to predict multiple labels :
We first train a model to predict G1 (9 labels), append the prediction to the features matrix X and finally fit a model to predict G2.
We used the following method to train, optimize and evaluate our models :
We defined tresholds to replace predictions with a low level of confidence by 'unknown' to increase the precision and get a deployable model but keep most of the original predictions (at least 80%).
Tresholds definition is based on probabilities distribution per label
on the validation dataset with a hue=pred_is_true
(green
if pred is true, red
else) :
We used Cross-Entropy as evaluation metric. The loss path for the first label (G1) with no regularization is the following :
When we control the complexity, we obtain closer train & valid error paths :
Final tuned parameters used to build our model for the G1 :
model = XGBClassifier(
#----Fixed parameters----
booster='gbtree', #gradient boosted tree
tree_method='hist', #hist method
eval_metric='mlogloss', #cross entropy
#----Control train path----
n_estimators=750, #n epochs
learning_rate=0.15, #not too small lr
#----Control data fitting----
colsample_bylevel=0.9, #features
colsample_bynode=0.9, #features
colsample_bytree=0.3, #features
subsample=0.8 #samples
#----Control complexity----
max_depth=40, #high max depth
gamma=3, #loss complexity
min_child_weight=1.5, #derivative of the loss
reg_alpha=1, #L1 norm
reg_lambda=1, #L2 norm
)
eval_set = [(X_train, y_train_G1), (X_valid, y_valid_G1)]
model.fit(X_train, y_train_G1, eval_set=eval_set, early_stopping_rounds=50)
We use similar parameters then to fit a model to predict label 2 (G2).
Our finals results, with metrics that matters to deploy the model are the following :
Where :
- G1 is the first label prediction (9 unique labels)
- G2 the second label prediction (38 unique labels)
- Precision macro is the macro average precision for all categories of the dedicated group (G1 or G2)
- N_cats < 85% is the number of labels in a group for which the precisin is more than 85%
- N_preds 'unknown' is, after tresholds filtering, the proportion of predictions that are changed to 'unknown'.
You can use the final model with the class XGFood (code here).
To try the model, the dataset with 167 548 labelled (y_true
) and unprocessed products (X_raw
)
can be found on this Google drive.
Models files .model
can be downloaded here.
Finally, you can download the small folder with json files.
**Put the
.model
files in the files folder, and the files folder in the same folder thatxgfood.py
before run the model. Otherwise, you will not be able to load every dependent files.
After loaded the dataset, you can load models by instantiating the XGFood class :
import xgfood.XGFood
model = XGFood()
And then use the predict method :
model.predict(X_raw)
By default, the output is a pandas DataFrame
with 4 columns :
y_pred_G1 | y_conf_G1 | y_pred_G2 | y_conf_G2 |
---|---|---|---|
milk and dairy products | 0.90 | milk and yogurt | 0.93 |
sugary snacks | 0.99 | sweets | 0.98 |
composite foods | 0.67 | one dish meals | 0.98 |
... | ... | ... | ... |
fruits and vegetables | 1.00 | fruits | 0.97 |
Where :
y_pred_G1
: predictions for group 1 (9 labels + 'y_unknown')y_pred_G2
: predictions for group 2 (38 labels + 'y_unknown')- if
get_confidences
set toTrue
(default=True
):y_conf_G1
: level of confidence for G1 predictiony_conf_G2
: level of confidence for G2 prediction
You can change and get a numpy array
format by passing pred_format = 'np'
in the predict method :
model.predict(X_raw, pred_format='np')
If you want, you can ask for only the predictions (without confidence levels) by setting get_confidence = False
:
model.predict(X_raw, get_confidence = False)
You can also set preprocessed = False
if you want to preprocess X before with the .process_X
method :
X = model.process_X(X_raw)
predictions = model.predict(X, preprocess=False)
This is the structure of
xgfood.py
Given a X_raw, predictions are obtained with the following processus :
- Creation of a null
X
matrix 938_features, n_samples, where 938 corresponds to the variables and N the number of products to be labeled - Navigation in the
ingredients
column provided to fill the 450 variables relating to the ingredients with estimates on the percentage present of the ingredient for each product - Navigation in the
product_name
column provided to fill the 488 variables relating to the text with 1 if the word is present. - Send data to model 1 for Group 1 prediction
- Save predictions and probabilities
- Filter with thresholds defined by category (updated by 'y_unknown' if the confidence level is below the threshold)
- Update
X
with Group 1 prediction - Sending updated data in model 2 for Group 2 prediction (38 unique labels)
- Save predictions and probabilities
- Filtering with thresholds defined by category (updated by 'y_unknown' if the confidence level is below the thresholds)
- Decode predicted labels
- Add predictions + confidence levels in a dataframe
- Output : array numpy or DataFrame pandas
[pred_G1, conf_G1, pred_G2, conf_G2],n_predictions
The Evaluator class is dedicated to evaluate quickly model performances with basic but formated seaborn plot. We used mainly seaborn and matplotlib.
The class is composed of a method build_data
dedicated to build the dataset used and then different simple plot & metrics methods.
Simple plots showed before in this documents are some examples.
You can access the code here
After downloaded evaluator.py
, you can instantiate the class and build the dataset with your inputs :
evaluation = Evaluator()
evaluation.build_data(y_true=y_true, y_pred=y_pred, y_confidence=y_conf)
Note that y_confidence is optional and default None
(but required for the .plot_confidence
method) :
evaluation = Evaluator()
evaluation.build_data(y_true=y_true, y_pred=y_pred)
List of available methods :
#Default arguments
Evaluator.global_metrics(average='weighted', name='Model')
Return Accuracy, Recall, Precision and F-1 score. Average can take macro
or weighted
By default, the output start with a header Model Classification Metrics
:
Evaluator.global_metrics()
Model Classification Metrics :
------------------------------
Accuracy : 85.10%
Recall : 85.10%
Precision : 95.23%
F1-score : 89.77%
You can change this with the desired text if you want to keep tracability in your output of what you are evaluating :
Evaluator.global_metrics(name='XGBoost G1 Validation Set')
XGBoost G1 Validation Set Classification Metrics :
--------------------------------------------------
Accuracy : 85.10%
Recall : 85.10%
Precision : 95.23%
F1-score : 89.77%
#Default arguments
Evaluator.classification_report(
sortby='precision', name='model', save_report=False, report_path='classification_report.csv'
)
Return a standard sklearn.metrics.classification_report()
but in pandas.DataFrame
format.
Report can be sorted by precision, recall, f1-score or support (default='precision'
) :
Evaluator.classification_report(sortby='support')
Report can also be saved in csv format by setting save_report=True
.
Default path is current folder and file name 'classification_report.csv'
but you can change it in report_path=*your_path*
:
Evaluator.classification_report(save_report=True, report_path='files/classification_report.csv')
Here too you can change the name, it will effect on columns name :
Evaluator.classification_report(name='XGBoost G1')
XGBoost G1_precision | XGBoost G1_recall | XGBoost G1_f1-score | XGBoost G1_support | |
---|---|---|---|---|
fat and sauces | 0.98 | 0.90 | 0.94 | 63.00 |
milk and dairy products | 0.98 | 0.89 | 0.93 | 131.00 |
fish meat eggs | 0.97 | 0.88 | 0.92 | 135.00 |
beverages | 0.97 | 0.77 | 0.86 | 108.00 |
salty snacks | 0.96 | 0.83 | 0.89 | 58.00 |
sugary snacks | 0.96 | 0.94 | 0.95 | 214.00 |
weighted avg | 0.95 | 0.85 | 0.90 | 1000.00 |
composite foods | 0.93 | 0.83 | 0.88 | 101.00 |
cereals and potatoes | 0.92 | 0.74 | 0.82 | 111.00 |
fruits and vegetables | 0.90 | 0.76 | 0.82 | 79.00 |
macro avg | 0.86 | 0.75 | 0.80 | 1000.00 |
accuracy | 0.85 | 0.85 | 0.85 | 0.85 |
y_unknown | 0.00 | 0.00 | 0.00 | 0.00 |
#Default arguments
Evaluator.plot_categories_scores(
metric='precision', name='Model', figsize=(8, 10), save_fig=False, fig_path='score_by_category.png'
)
Return a point plot with desired metric per category, sorted by metric. Default metric is precision.
Here too you can change the name (it will effect on the title), save the figure. You can also change the figsize :
Evaluator.plot_categories_scores(name=r'XGBoost$G_2$, Valid Set', figsize=(8, 10))
#Default arguments
Evaluator.plot_confusion_matrix(
name='Model', figsize=(20, 15), annot=True, cmap='Greens', save_fig=False, fig_path='confidence_by_category.png'
)
Return a confusion matrix with annotations, seaborn heatmap design.
Evaluator.plot_confusion_matrix()
You can also remove annotations by passing annot=False
.
#Default arguments
Evaluator.plot_confidence(
name='Model', metric='precision', col_wrap=5, save_fig=False, fig_path='confidence_by_category.png'
)
Return a KDE plot for every category, with confidence (probabilities) distribution (hue=pred_is_true, with green if pred was true, red else). The desired metric for is also present on the title of each category plot (default precision).
You can save the figure, and you can change the number of plots per line/row with col_wrap
(default is col_wrap=5
) :
Evaluator.plot_confidence(metric='precision', col_wrap=5)
This part is a discussion about limits of the method adopted, some areas of improvement and future work possibilities we found potentially interesting.
Results are quite interesting and XGBoost perform well on both validation and test sets. However, the method we used to preprocess features has some limitations.
- We select only N most frequent words and ingredients. Some "outliers" products may have no discriminative ingredients or words. Some products would have a null vector (0 everywhere), or only a few non-null features that not help the tree directions to classify with an acceptable confidence level.
- Moreover, our method end up with 900+ features. This makes the model difficult to interpret and features difficult to explore. Another processing method may give a less complex dataset.
- When we fill the features relative to ingredients, we first try with the percent estimate. If it is not available, we then try with percent min and percent max. Finally, we fill with "1" or the median when the ingredient is present in the product but no info about percent is available. This method could be improved.
- Finally, the cleaning for words and ingredients list may be largely improved to avoid duplicates ('juice' and 'juices' for example) and uninteresting words.
We did not explore some potentially interesting features in the original dataset. A more in-depth work on text & object features could lead to better results.
We tuned the model for a long time "manually" and have finally run a cross-validation grid search to get interesting parameters. However, XGBoost library give a lot of possibilities to tune the model. Is it largely possible we did not tune correctly enough the model and a more carrefuly selected parameters list for grid search and a better control of the bias/variance tradeoff could lead to better results. Is it well known for example that gamma (loss complexity control) has a huge impact and the tuning depend largely on both other parameters and dataset.
Finally, we show that text features (ingredients, product name, etc) are strongly effective predictors. To get the most of those features, natural language processing techniques, especially in the deep learning area (recurrent neural networks or attention models for example) would be a good solution.
We tried some approachs without going in depth. Our models do not give very good results (50-65% macro accuracy) but are largely improvable.
Using ingredients raw text and product_name combined in a single feature 'text', we tried to build a bidirectionnal LSTM in pytorch. Code can be found here and the architecture is as follow :
We use spacy to tokenize the data and word embedding from glove 6B 300dim pretrained vectors.
Despite using drop-out in the LSTM layer, this model largely overfit : we get less than 0.05 loss error on the training and 1.65 on validation. We did not explore how to improve architecture and prevent overfitting. But a future work in this area could lead to very interesting results.
We also trained the AWD LSTM of fastai (pretrained on Wikipedia). We got about 65% of accuracy (code here). We think a better text preprocessing and tokenization may lead to sensibly better results. AWD LSTM is inspired by this original paper. A good way to explore the fastai version is this notebook. To reproduce the model in plain pytorch, you can also look at this github repo and this one.
Another project would be to train an encoder with selected data that could be interesting to classify food based on ingredients : supermarket & other food grocery websites, or cooking blog, etc. We did not think too much about this idea but it may be a good project to explore (we also can think with this method to other projects than classification).
Finally, because of the multilingual nature of text in the dataset, we thought about some architectures to explore :
We can train a multingual model or a model per language like this :
Or finally, translate text with an API before fitting a unique english model :
Neural nets in NLP are definitly promising.