Kaggle Higgs ML Challenge

This repository contains my solution for the Higgs Machine Learning Challenge on Kaggle by using the XGBoost algorithm. I also compare it to a Sklearn naive Bayes classifier model.

Overview

The Higgs boson machine learning challenge (HiggsML or Challenge in short) has been set up to promote collaboration between high-energy physicists and data scientists. The ATLAS experiment at CERN provided simulated data used by physicists to optimize the analysis of the Higgs boson.
The presence of a large dataset with multiple features, along with many missing values suggested the use of the XGBoost algorithm. XGBoost is an industry-proven, open-source software library that provides a gradient-boosting framework for scaling billions of data points quickly and efficiently. I also implemented a naive Bayes theorem, so as to get an insightful comparison between the 2 approaches.
The naive Bayes classification model gave an accuracy score of 67.54%, with AUC = 0.735.
On the contrary, the XGBoost algorithm with optimized hyperparameters gives an accuracy score of approximately 84.10% with AUC= 0.911.

Data

Data:
- Type:
  - Input: training.csv - Training set of 250000 events, with an ID column, 30 feature columns, a weight column, and a label column
  - Input: test.csv - Test set of 550000 events with an ID column and 30 feature columns.
  - Input: random_submission - Sample submission file in the correct format. File format is described on the Evaluation page.
  - Input: HiggsBosonCompetition_AMSMetric - Python script to calculate the competition evaluation metric.
- Size: 56.9 MB

Preprocessing / Clean up

Changed the values of the "Labels" column in the training dataset to (1,0) from (s,b) [which denotes signal and background]. This is done to help us perform better data manipulation. Furthermore, the training file was then separated into a signal and background data frame with respect to the value of the Labels.

Data Visualization

The above pie chart shows the split between the number of signals (1/3rd of the total training dataset) and backgrounds in the training dataset.

The above pairplot shows the relation between the signal and the background and helps us to to understand better where to make the selection, or what separation algorithm to use.

Problem Formulation

The signal sample contains events where Higgs bosons (with fixed mass 125 GeV) were produced. The background sample was generated by other known processes that can produce events with at least one electron or muon and a hadronic tau, mimicking the signal. For the sake of simplicity, only three background processes were retained for the Challenge. The first comes from the decay of the Z boson (with mass 91.2 GeV) in two taus. This decay produces events with a topology very similar to that produced by the decay of a Higgs. The second set contains events with a pair of top quarks, which can have lepton and hadronic tau among their decay. The third set involves the decay of the W boson, where one electron or muon and a hadronic tau can appear simultaneously only through imperfections of the particle identification procedure.

Define:
- Models
  - Naive Bayes Classifier- It is a probabilistic machine learning model, based on Bayes' Theorem, widely known for its simplicity and computational efficiency. They are particularly fast at training and prediction, which can be advantageous when dealing with large datasets.
  - XGBoost- Also known as Extreme Gradient Boosting. It is known for its high performance and accuracy, especially effective in handling structured data with complex relationships and interactions among features.
- The XGBoostClassier implementation is being used in this model.
- The hyperparameters used for the XGBoost algorithm are- depth of each decision tree, step size at each iteration (learning rate), fraction of samples used for training, regularization term (reg_alpha and reg_lambda) on weights (add sparsity to the model), control complexity of tree by pruning nodes that do not lead to sufficient loss reduction.

Training

Training set synopsis: Number of observations: 250000, Number of columns: 33, Number of integer columns: 2, Number of float columns: 30, Number of object columns: 1, Memory Usage: 62.94 MB
The training data set was first divided into X_train and y_train, where X_train contains all the variable names and features, while y_train gives an indication, whether those corresponding feature values give us a signal or a background.
Feature set contains all columns except EventID, Weight and Label, while the target set is just the Label column
For the Naive Bayes Classifier, a simple default classification fit was done between x_train and y_train. Then a y_pred was based based on the features of the test dataset (X_test)
For XGBoost, I first defined the estimators and other hyperparameters, and then used a Bayesian optimizer for the optimization process.

Performance Comparison

Model	Accuracy
Naive Bayes	67.5
XGBoost	84.10

The above image shows the ROC curve for the Naive Bayes Model. We see an AUC score of 0.735, which is definitely better than random guessing, but nowhere near good enough, for a classification problem of this stature.

The above snippet represents the ROC curve for the XGBoost model. We see an AUC score of 0.911, which is near about perfect, and thereby suggests a good classification.

A feature importance plot for the XGBoost algorithm is shown above. We see that clearly the Derived Variable called DER_mass_MMC is the most important feature for this classification process. This variable gives an estimated mass of the Higgs Boson Candidate.

Conclusions

From the analysis done above, we clearly can tell that XGBoost performs much better than the Naive Bayes Classifier.

Software Setup

Packages- pandas, numpy, matplotlib, Sklearn, category_encoders, XGBoost
Disk Space- 4.5GB
Notebook Runtime- Approx. 16mins

Citations

Approximate Median Significance - the Higgs Machine Learning ..., higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
higgs-boson		higgs-boson
.DS_Store		.DS_Store
README.md		README.md
UTA-DataScience-Logo.png		UTA-DataScience-Logo.png
higgs-kaggle-ml-challenge.ipynb		higgs-kaggle-ml-challenge.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Higgs ML Challenge

Overview

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Software Setup

Citations

About

Releases

Packages

Languages

pupu2002/HiggsMLChallenge_Kaggle_2023

Folders and files

Latest commit

History

Repository files navigation

Kaggle Higgs ML Challenge

Overview

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Software Setup

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages