This project is the final assignment for the "Introduction to Data Science" course. The objective is to demonstrate the application of data science principles to a real-world dataset. Our chosen domain is comics, and the data has been crawled from MyAnimeList, a popular platform for anime and manga enthusiasts.
- University: University of Science - VNUHCM
- Course: CSC14119 - Introduction to Data Science
- Duration: 16 November 2023 - 14 December 2023
Instructors:
- Nguyen Thi Thu Hang
- Nguyen Bao Long
- Le Duc Thanh
- Nguyen Ngoc Thao
Contributors:
Student ID | Full Name |
---|---|
21127104 | Doan Ngoc Mai |
21127129 | Le Nguyen Kieu Oanh |
21127229 | Duong Truong Binh |
21127616 | Le Phuoc Quang Huy |
The project is organized into four Jupyter Notebook files, each representing a distinct step in the data science workflow:
-
Data_Collecting.ipynb
- This notebook contains the code for web scraping the data from MyAnimeList.
- It outlines the process of extracting and storing the data in a structured format for further analysis.
-
Data_Exploration_and_PreProcessing.ipynb
- This notebook focuses on exploring the collected data to understand its structure and content.
- It includes data cleaning, handling missing values, and preprocessing steps to prepare the data for analysis.
-
Asking_Questions_and_Analyzing.ipynb
- In this notebook, we define key questions and hypotheses about the data.
- It includes visualizations and statistical analyses to extract insights and answer the formulated questions.
-
Data_Modelling.ipynb
- This notebook involves building and evaluating machine learning models based on the prepared dataset.
- The models aim to predict or classify based on specific features of the comic data.
The dataset used in this project was collected via web scraping from MyAnimeList. The data includes various attributes such as:
- Titles
- Genres
- Ratings
- Popularity
- Other metadata related to comics and manga.
To run the project, ensure the following libraries and tools are installed:
- Python 3.x
- Jupyter Notebook
- Libraries:
numpy
pandas
matplotlib
seaborn
scikit-learn
beautifulsoup4
requests
- Clone this repository to your local machine.
- Install the required Python libraries.
- Open each notebook in Jupyter Notebook.
- Follow the order of the notebooks to reproduce the results.