Skip to content

Latest commit

 

History

History
110 lines (67 loc) · 6.9 KB

README.md

File metadata and controls

110 lines (67 loc) · 6.9 KB

Kaggle Higgs ML Challenge

  • This repository contains my solution for the Higgs Machine Learning Challenge on Kaggle by using the XGBoost algorithm. I also compare it to a Sklearn naive Bayes classifier model.

Overview

  • The Higgs boson machine learning challenge (HiggsML or Challenge in short) has been set up to promote collaboration between high-energy physicists and data scientists. The ATLAS experiment at CERN provided simulated data used by physicists to optimize the analysis of the Higgs boson.
  • The presence of a large dataset with multiple features, along with many missing values suggested the use of the XGBoost algorithm. XGBoost is an industry-proven, open-source software library that provides a gradient-boosting framework for scaling billions of data points quickly and efficiently. I also implemented a naive Bayes theorem, so as to get an insightful comparison between the 2 approaches.
  • The naive Bayes classification model gave an accuracy score of 67.54%, with AUC = 0.735.
  • On the contrary, the XGBoost algorithm with optimized hyperparameters gives an accuracy score of approximately 84.10% with AUC= 0.911.

Data

  • Data:
    • Type:

      • Input: training.csv - Training set of 250000 events, with an ID column, 30 feature columns, a weight column, and a label column
      • Input: test.csv - Test set of 550000 events with an ID column and 30 feature columns.
      • Input: random_submission - Sample submission file in the correct format. File format is described on the Evaluation page.
      • Input: HiggsBosonCompetition_AMSMetric - Python script to calculate the competition evaluation metric.
    • Size: 56.9 MB

Preprocessing / Clean up

  • Changed the values of the "Labels" column in the training dataset to (1,0) from (s,b) [which denotes signal and background]. This is done to help us perform better data manipulation. Furthermore, the training file was then separated into a signal and background data frame with respect to the value of the Labels.

Data Visualization

image

The above pie chart shows the split between the number of signals (1/3rd of the total training dataset) and backgrounds in the training dataset.

image

The above pairplot shows the relation between the signal and the background and helps us to to understand better where to make the selection, or what separation algorithm to use.

Problem Formulation

The signal sample contains events where Higgs bosons (with fixed mass 125 GeV) were produced. The background sample was generated by other known processes that can produce events with at least one electron or muon and a hadronic tau, mimicking the signal. For the sake of simplicity, only three background processes were retained for the Challenge. The first comes from the decay of the Z boson (with mass 91.2 GeV) in two taus. This decay produces events with a topology very similar to that produced by the decay of a Higgs. The second set contains events with a pair of top quarks, which can have lepton and hadronic tau among their decay. The third set involves the decay of the W boson, where one electron or muon and a hadronic tau can appear simultaneously only through imperfections of the particle identification procedure.

  • Define:
    • Models
      • Naive Bayes Classifier- It is a probabilistic machine learning model, based on Bayes' Theorem, widely known for its simplicity and computational efficiency. They are particularly fast at training and prediction, which can be advantageous when dealing with large datasets.
      • XGBoost- Also known as Extreme Gradient Boosting. It is known for its high performance and accuracy, especially effective in handling structured data with complex relationships and interactions among features.
    • The XGBoostClassier implementation is being used in this model.
    • The hyperparameters used for the XGBoost algorithm are- depth of each decision tree, step size at each iteration (learning rate), fraction of samples used for training, regularization term (reg_alpha and reg_lambda) on weights (add sparsity to the model), control complexity of tree by pruning nodes that do not lead to sufficient loss reduction.

Training

  • Training set synopsis: Number of observations: 250000, Number of columns: 33, Number of integer columns: 2, Number of float columns: 30, Number of object columns: 1, Memory Usage: 62.94 MB
  • The training data set was first divided into X_train and y_train, where X_train contains all the variable names and features, while y_train gives an indication, whether those corresponding feature values give us a signal or a background.
  • Feature set contains all columns except EventID, Weight and Label, while the target set is just the Label column
  • For the Naive Bayes Classifier, a simple default classification fit was done between x_train and y_train. Then a y_pred was based based on the features of the test dataset (X_test)
  • For XGBoost, I first defined the estimators and other hyperparameters, and then used a Bayesian optimizer for the optimization process.

Performance Comparison

Model Accuracy
Naive Bayes 67.5
XGBoost 84.10
image

The above image shows the ROC curve for the Naive Bayes Model. We see an AUC score of 0.735, which is definitely better than random guessing, but nowhere near good enough, for a classification problem of this stature.

image

The above snippet represents the ROC curve for the XGBoost model. We see an AUC score of 0.911, which is near about perfect, and thereby suggests a good classification.

image

A feature importance plot for the XGBoost algorithm is shown above. We see that clearly the Derived Variable called DER_mass_MMC is the most important feature for this classification process. This variable gives an estimated mass of the Higgs Boson Candidate.

Conclusions

  • From the analysis done above, we clearly can tell that XGBoost performs much better than the Naive Bayes Classifier.

Software Setup

  • Packages- pandas, numpy, matplotlib, Sklearn, category_encoders, XGBoost
  • Disk Space- 4.5GB
  • Notebook Runtime- Approx. 16mins

Citations

  • Approximate Median Significance - the Higgs Machine Learning ..., higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf.