Skip to content

Latest commit

 

History

History
170 lines (84 loc) · 8.45 KB

syllabus.md

File metadata and controls

170 lines (84 loc) · 8.45 KB

Data Science with Kaggle Syllabus Spring 2017

Introduction:

Welcome to Data Science with Kaggle! Kaggle is home to an abundant source of company-volunteered data that encourage data scientists from around the world to solve proposed, and often business-related, challenges. The platform fosters a great amount of knowledge sharing, competition, and practical relevance where beginners and experts alike benefit from an exponentially expanding field.

Prerequisites:

This class is a projects-based class with a machine learning bias. You are expected to have some programming or statistics backgrounds and so the material will be of greatest benefit to sophomores or those who have taken CS61A, DATA 8, STAT 133, or equivalent. However, the first two weeks of class will be an optional python bootcamp for those taking the course with absolutely no programming background. By the end, you can determine whether you are comfortable continuing through the course.

Note that this is not an easy class. The student facilitators intend to provide you with a comprehensive guide to data analysis with the goal of preparing you for industry and, if demonstrated superb interest, future machine learning competitions.

Learning objectives:

Listed below is a subset of topics you will learn from this course:

  1. Python programming

  2. Data interpretation, data munging, and visual analysis

  3. Numerical and Text

  4. Linear and Logistic Regression

  5. Clustering Techniques

  6. RandomForest

  7. Keras

  8. Neural Networks

  9. AWS for Machine Learning

Project Schedule:

Each project is made to last for two weeks. Projects can be completed individually or in teams of at most 4 students.

The first project will guide students along the intuition and steps for data analysis. The last two projects will be more open-ended for students to figure out how to complete.

  • The first two weeks will involve analysis of the Titanic data set. This is a widely studied and perhaps most popular data set on Kaggle. We will guide you through the python language and teach you basic analysis techniques you can do.

  • MNIST Digit Recognizer – Clustering Techniques

  • Auto-Librarian Classification – RandomForest, text analysis, and clustering techniques

  • Company-sponsored In Class Kaggle Competition - All data science methods

Assignment Schedule:

Early assignments (during the bootcamp and possibly the first week of class) will involve hand-written conceptual based questions to check for understanding.

Afterwards, assignments will involve in-class kaggle competitions where students submit their model predictions on a custom arranged data set separate from lecture. This will give you a chance to apply what you have learned in class. These assignments should be done individually.

Class Logistics:

First two weeks will be an optional programming bootcamp for students to get introduced to programming and to catch up to the level of skill necessary to complete the course.

There will be 3 hours of lecture per week in addition to office hours. Assignments will be given at the beginning of each module and will end at the end of each module. Note that the first few weeks of class will involve handwritten weekly assignments that test for understanding only.

Grading:

Assignments early on will be graded on completeness. In-class Kaggle assignments will be graded on whether the prediction score is above a certain threshold. Final projects will be graded on completeness, quality of response, accuracy, and team mate evaluations. There will be 2 open-ended projects throughout the latter half of the semester. All assignments, projects, and attendance will be assigned a point value with a general weighting scheme of 40% assignments, and 60% projects. In order to pass the class, you must pass the minimum accuracy threshold for all assignments and projects.

Suggested Online Reading Schedule:

These are hand-picked resources the student instructors strongly believe will help you understand lecture.

Docker Instructions

2/20 - https://www.kaggle.com/c/titanic/details/getting-started-with-python

2/27 - https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii

3/8 - http://opencvpython.blogspot.com/2012/12/k-means-clustering-1-basic-understanding.html

3/13 - https://www.dataquest.io/blog/k-nearest-neighbors-in-python/

3/15 - https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

3/22 - http://blog.yhat.com/posts/random-forests-in-python.html

4/3 - http://natureofcode.com/book/chapter-10-neural-networks/ (Introduction and Section 10.2)

4/5 - http://neuralnetworksanddeeplearning.com/chap2.html

4/10 - https://keras.io/getting-started/sequential-model-guide/

4/12 - www.asimovinstitute.org/neural-network-zoo/

Recommended Texts and Online Readings:

Most material will be in the form of powerpoint slides, handouts, and live demos. However, there are a few resources we recommend reading throughout the course to better understand concepts or a programming language. Some are really just fun reads.

The Data Science Handbook by Carl Shan, Henry Wang, William Chen, and Max Song: http://www.thedatasciencehandbook.com

Python for Data Analysis by Wes McKinney: http://shop.oreilly.com/product/0636920023784.do

The Signal and the Noise by Nate Silver: https://en.wikipedia.org/wiki/The_Signal_and_the_Noise

Book for nitty-gritty of neural networks: http://neuralnetworksanddeeplearning.com/

Recurrent Neural Networks:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Extra Credit:

Students will be able to receive extra credit for completing a side project with instructor approval.

Attendance:

Since this is a project-team-based class, attendance is mandatory. We will be keeping track at the beginning of each class. However, you may have two absences for any reason. If you are working with a team, please communicate appropriately.

Class Schedule:

2/6 - Decal Kickoff. Overview of tools that will be in play. This lecture will help you determine if you need to go to the bootcamp.

2/8 - (Optional) Python setup. Coding environment setup. Variables/Data Types. If, for, while statements.

2/13 - (Optional) Data Visualization. Reading data. Structure of arrays/matrices/dataframes. Objects. Histograms and missing value imputation. Summary statistics.

2/15 - (Optional). Difference between Classification and Regression. Exploring data hands on.

2/20 - President's Day: No Class

2/22 - Data cleaning. Regular expressions.

2/27 - Linear Regression. Assumptions for intuition. Residuals. Interpretation. Practical example on Titanic. Example of linear dependence.

3/1 - Logistic Regression. Assumptions for intuition. Difference from Linear Regression. Practical example on Titanic. Regression assignment due 3/5.

Project 1 released. Due March 17th

3/6 - Introduction of Digit Recognition data set. Linear and logistic regression. KNN clustering.

3/8 - K-means and Cross-Validation. Sign up for awseducate

3/13 - In-class implementation of KNN. Bias and variance tradeoff.

3/15 - Decision Trees.

3/20 - Random Forests.

3/22 - Lecture on Amazon Web Services (AWS).

3/27 - Spring Recess: No Class

3/29 - Spring Recess: No Class

4/3 - Introduction to Neural Networks

4/5 - How Neural Networks Learn / Gradient Descent

4/10 - Introduction to Convolutional Neural Nets through Keras

4/12 - Recurrent Neural Networks. Text recurrence.

4/17 - Anomaly Deterction. Using LSTMs.

4/19 - Guest Lectures

4/24 - Guest Lectures

4/26 - Final Lecture. Review.