This is a blog about projects I work on to improve my programming and data analyst skills.
I am currently working towards a degree in Computer Science with Data Analytics. I previously worked in education and have taught in schools in The United States, Germany, South Korea and China. In addition to English, I can speak German and survival Chinese. I am particularly interested in education, politics and issues related to social equity.
I have completed several data analysis projects as part of my degree.
-
Fraud Detection: I used Python to clean and explore a set of data comprimised of information about credit card customers and their purchases. I used Association Rules Mining and Cluster Analysis to create profiles of customers who were more likely to experience credit card fraud. I then used Naive Bayes and Decision Trees to look for a correlation between fraudulant purchases and location.
-
Health Predictions: I used Python to clean and explore the 2015 NY state Behavioral Risk Factor data. I designed an SQL database to store data and queried it as needed for analysis. After exploring the data I constructed research questions to guide the analysis. I then used Linear Regression, Naïve Bayes, and Hoeffding trees to look for connections bewtween arthritis severity, doctor’s advice and exercise.
-
Diabetic Correlations: Python was used to clean and explore data related to diabetes and hospital admissions. I designed research questions that could be answered using the given data. Linear Regression, Logistic Regression, Random Forest and SMO support vector machines were used to look at the connections between diabetes, medications and hospital readmissions.
-
Health Code Violations: Python was used to create a GUI which analyzed data and displayed visualizations showing the relationship between restaurant health code violations and zip code. MySQl was used to store and manipulate data.
-
Topic Modelling Podcasts: Python was used to clean and prepare podcast transcripts for analysis. Various topic modelling algorithms were used to explore which topics were most prevelant in podcasts. Different visualization methods, such as word clouds and graphing the distance between topics were used to investigate how topic modelling might be used to identify podcasts related to specific needs in education, advertising and horizon scanning. Topics were also used to support the creation of a podcast search engine.
In addition, I have completed one personal project and am working on a second one.
-
Making Movies Successful: I combined Movie Data from Kaggle with Profit information obtained using the TMDB API. Python was used to clean and analyze data. Found a connection between content ratings, genre and movie profits as well as between the number of movies actors were previously in and profits.
-
Use of Online Platforms In The Time of Covid: This is the project I am currently working on. I am looking to see how the use of Online Learning Platforms changed during 2020. I am also investigating correlations between The use of Online Platforms and other factors such as school culture, percentage of students receiveing free and reduced lunch and achievement on standardized tests scores. Python was used to clean data and a MySQL database was used to store it. Data was anlaysed using statistics, linear regression and association rules mining.
-
Coin Sorter: I used Java, JavaFx and CSS to build a GUI which asked the user to enter an amount in cents. It then calculated the possible options for currency exhange as well as the number of bills and coins that the user could choose to have returned. This project also involved defining and implementing test cases to verify that the program worked as intended.
-
Client_Server: Java was used to simulate a Client and Server running the stop-and-wait and go-back-N network protocols.
- Java: object oriented programming
- Python: scikit learn, pandas, numpy, matplot lib, spaCy, ntlk
- SQL
- Database design
- Machine Learning: Clustering, Association Rules Mining, Classification, Topic Modelling, Choosing appropriate models
- R: tidyverse, arules