Protein Family Classification from Raw Protein Sequences using Naïve Bayes
• Labelled Structural Protein Sequence dataset containing 346,325 ‘string’ type datapoints were imported from Kaggle, followed by pre-processing
• Feature extraction from the raw string data were performed using CountVectorizor
• Naïve Bayes classifier were utilized for the prediction from the count vectorized features, followed by AdaBoost classifier for comparison
• Accuracy achieved in the task of classification was 76.38%