The primary objective of this project is to develop a model that can predict movie ratings based on provided features.
This project utilized the following datasets:
- Movie_data: Contains preprocessed movie information, including MovieName, Genre, and MovieIDs.
- Ratings_data: Contains preprocessed ratings information, including Ratings and Timestamp.
- Users_data: Contains preprocessed user information, including Gender, Age, Occupation, and Zip-code.
The project made use of several essential libraries:
- numpy
- pandas
- matplotlib.pyplot
- seaborn
- sklearn.preprocessing.LabelEncoder
- sklearn.preprocessing.MinMaxScaler
- sklearn.model_selection.train_test_split
- sklearn.linear_model.LogisticRegression
- Loaded Movie_data, Ratings_data, and Users_data as DataFrames from separate CSV files.
- Eliminated missing values from each DataFrame using
dropna(inplace=True)
. - Displayed the shape and descriptive statistics for each DataFrame using
df.shape
anddf.describe()
. - Converted the 'Gender' column in Users_data from categorical to numerical values using LabelEncoder.
- Horizontally concatenated the DataFrames using
pd.concat
to create a final datasetdf_data
. - Dropped unnecessary columns like 'Occupation', 'Zip-code', and 'Timestamp' from the final dataset to create
df2
. - Removed any remaining missing values from the final dataset
df2
usingdropna()
. - Conducted data visualization using count plots and histograms to analyze the distribution of ratings, genders, and age.
- Created the feature matrix
input
and target vectortarget
using relevant columns from the final datasetdf_final
. - Split the data into training and testing sets using
train_test_split
. - Scaled the input data using MinMaxScaler to normalize the values between 0 and 1.
- Initialized a logistic regression model and trained it on the training data using
LogisticRegression
. - Employed the trained model to predict movie ratings for the test set using
model.predict(X_test)
.