Skip to content

Latest commit

 

History

History
216 lines (155 loc) · 47.8 KB

README.md

File metadata and controls

216 lines (155 loc) · 47.8 KB

Coding for Data Analysis with R

Introduction to Data Analysis with R - lecture materials by Ágoston Reguly (CEU) with Gábor Békés (CEU, KRTK, CEPR)

This course material is a supplement to Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021.

Textbook information: see the textbook's website gabors-data-analysis.com or visit Cambridge University Press

To get a copy: Inspection copy for instructors or buy from Amazon or order online around the globe

Acknowledgments

We thank CEU Department of Econimics and Business for financial support.

Status

This is version 0.2. (2022-07-11)

Comments are really welcome in email or as a GitHub issue.

Overview

The course serves as an introduction to the R programming language and software environment for data exploration, data munging, data visualization, reporting, and modeling.

Lectures 1 to 11 complements Part I: Data Exploration (Chapter 1-6) focuses on basic programming principles, data structures, data cleaning and data exploration with descriptives and graphs, and simple hypothesis testing. This is an intro package to learning R and using it for exploration and some basic analysis.

Lecture 12 to 20 complements PART II: Regression Analysis (Chapter 7-12) focuses on statistical methods such as nonparametric regression, single and multiple linear cross-sections, binary models and simple time-series analysis while adding more advanced toolkit for visualization and reporting. This is a regression focused package with advnaced features for analysis including markdown.

Lecture 21 to 27 complements PART III: Prediction (Chapter 13-18). These lectures are not intended to be part of an introductory R course, but rather a more advanced seminar to support Data Analysis with machine learning tools for prediction. In this seminar-style course, students will cover topics such as model selection with cross-validation, LASSO, RIDGE or Elastic Net regularization, regression trees with CART, random forest, and boosting. These methods are applied to cross-sectional data, especially to the continuous outcome, and also for binary outcomes to model probability and handle classification problems. Time series modeling on the long run and short run via ARIMA and VAR models are also covered. For properly understanding this material, the prerequisite is to complete the coding lectures from 1 to 19.

Teaching philosophy

We believe students will learn using R by writing scripts and solving problems on their own. We provide and show them good practices on how to carry out such tasks, but extensive usage is needed.

This is not a hardcore coding course, but a course to supplement data analysis. The material focuses on specific issues in this topic and balances between higher levels of coding such as tidyverse -- which is more intuitive, easier to learn, but less flexible -- and lower levels in form of basic coding principles -- which allows greater complexity, deeper understanding, but requires much more practice and has a steeper learning curve.

The material structure reflects these principles. The majority of the lecturers have pre-written codes which include in-class tasks to practice and face problems along with regular homework. This enables the instructor to show a greater variety of codes, good examples for coding, and way more commands and functions than live coding while providing room for practicing. For this type of lecture, homework is essential, as it helps students to deepen their coding skills. There are also few live-coding lectures, which require flexibility and more preparation from the teacher (material provides detailed instructions). These lectures are focusing on basic coding principles such as the introduction to coding, functions, loops, conditionals, etc., and show students possible paths to hardcore coding, while showing alternative methods as well. Exceptions are lecture 21-27 as they are intended to use as a seminar material to support theory and assumes good level of coding. There are no homework and/or in-class tasks.

It is always a good question if solutions for the tasks or homework should be made available for students. We believe show students the in-class solution is beneficial and does not distort motivation as slower learners may want to revise and compare the true solution to their own. Hence, for each lecture, we provide the solutions for these tasks. However, this is not the case for the homework. We found that showing solutions to the students rather depresses their motivation and creativity, therefore there are no solutions for the homework. (It is important that there are (infinitely) many good solutions for an HW, thus we usually encourage students to try out different paths as well.)

How to use

This course material may be used as a basis for a course on learning coding with R for the purpose of analyzing data. It is developed to be taught simultaneously with the textbook but may be used independently. It is rather comprehensive and thus, may be used without any textbook to prepare.

We have not invented the coding wheel. Instead tried to adopt best practices and combine them with real-life case studies from the textbook.

There are no slides, but codes are commented heavily thus it should be easy to follow. In some cases, it is beneficial to read the related case study and/or the chapter to fully appreciate the codes and comments, but not necessary.

Within each lecture, there is an estimated time that the lecture would need with suggestions on how to shorten the lecture if it would be too long. The lectures are -- in purpose -- contain more material than what a classical 100-mins class per week for 12 weeks would take. It is always easier to cut material than add to it and the taste of each instructor and/or class may differ. We highly encourage you to use each lecture as a starting point and modify it accordingly. Later, we propose an example for this 100-mins class per week for a semester (12 weeks).

Sources

The material is based on multiple years of teaching coding courses at Central European University as well as advice from many many great resources such as

and many others, listed in the lecture's READMEs.

Lectures, learning outcomes, and case-studies

The following table shows a brief summary of the lectures: what is the type of the lecture, what is the expected learning outcome, and how it relates to the textbook's case studies and datasets.

Lecture Lecture Type Learning outcomes Case-study Dataset
PART I.
lecture00-intro live coding or pre-written Setting up R and RStudio. Introduction to the interface of R-studio. Packages and tryout of tidyverse and knitting a pre-written Rmarkdown - -
lecture01-coding-basics live coding Introduction to coding with R: R-objects, basic operations, functions, vectors, lists - -
lecture02-data-imp-n-exp pre-written How to import and export data with readr and APIs - hotels-vienna, football**
lecture03-tibbles pre-written Introduces tibble-s as data variable. Selecting, adding or removing rows (observations) and columns (variables). Convert to wide and long formta. Merge two tibbles in multiple ways. Ch 02C: Football Managers football
lecture04-data-munging pre-written Intro to data munging with dplyr: add, remove, separate, convert variables, filter observations, etc. Ch 02A: Hotels prep* hotels-europe
lecture05-data-exploration pre-written Intro to data exploration: modelsummary for descriptive stats in various ways, ggplot2 to plot one variable distributions (histogram, density) and two variable associations (scatter, bin-scatter), t.test for simple hypothesis testing. Core: Ch06A: Online vs offline prices. Related: Ch03A: Hotels: exploration, Ch04A: Management & firm size billion-prices, wms-management-survey**
lecture06-rmarkdown101 pre-written Intro to RMarkdown: knitting pdf and Html. Structure of RMarkdown, formatting text, plots and tables. Ch06A: Online vs offline prices* billion-prices, hotels-europe**
lecture07-ggplot-indepth pre-written Tools to cutomize ggplot2 graph. Write your own theme. Bar charts, box and violine plots. theme_bg() and source() from file and url. Ch03B: Hotels: Vienna vs London hotels-europe
lecture08-conditionals live coding Conditional programming: if-else statements, logical operations with vectors, creating new variables with conditionals. - wms-management
lecture09-loops live coding Imperative programming with for and while loops. Exercise to calculate yearly sp500 returns. Ch05A: Loss on stock portfolio sp500
lecture10-random-numbers live coding Introduction to random number generators and random sampling. Ch03D: Height and income, Ch05A: Loss on a stock portfolio* height-income-distributions, sp500
lecture11-functions live coding Writing functions: control for input(s) and output(s), error handling. User written confidence-intervals, sampling distribution for t-statistics, bootstrapping. Ch05A: Loss on a stock portfolio?*, Good-to-know: Ch06A: Online vs offline prices and Ch06B: Testing loss on a stock portfolio wms-management, sp500
PART II.
lecture12-intro-to-regression pre-written Intro to regressions: binary means, binscatters, non-parametric regression via lowess, simple linear regression. Predicted values and residuals. Ch07A: Hotels with simple regression hotels-vienna
lecture13-feature-engineering pre-written Intro to feature engineering. Covering variable transformations/manipulations which are used in the book/case-studies/this R course. Can be skipped, but good overview. Ch01C: Data collection, Ch04A: Management & firm size* , Ch08C: Measurement error as HW, Ch17A: Predicting firm exit* wms-management-survey, bisnode-firms, hotels-vienna**
lecture14-simple-regression live coding Level-level, log-level, level-log, log-log, polynomial and linear spline transformations for simple regressions. Weighted OLS. Graphical representation of these models. Model comparison, theory and statistical based decision for model choice. Ch08B: Life expectancy, Ch08A: Hotels with non-linear as HW worldbank-lifeexpectancy, hotels-vienna**
lecture15-advanced-linear-regression pre-written Introduces to multiple variable regression. Model evaluation: R2, prediction and error analysis with graphs. Confidence and prediction intervals. Robustness tests: checking parameter stability across time/location/type of obs. Ch09B: Hotel stability, Ch10B: Hotels with multiple regression hotels-europe
lecture16-binary-models pre-written Introduction to binary outcome models: saturated models, linear probability models, logit and probit models. Estimating average marginal effects for non-linear models, via marginaleffects and summarize by modelsummary. Evaluating models by R2, Pseudo-R2, Brier score and Log-loss. Comparison of predicted probabilities for certain groups and the distribution for different models. Bias of the model and calibration curve. Ch11A: Smoking health risk share-health
lecture17-dates-n-times pre-written Introduction to basic date and time variable manipulations. lubridate and rounding, differencing. Dataset aggregation, differenced and lag-ged variables, unit root tests. Visualize time series. Ch12A: Returns: company vs market** stocks-sp500
lecture18-timeseries-regression pre-written Introduction to time series analysis. Time-series data manipulations, simple visualizations and (partial) autocorrelation graph. Differencing, lags of outcome and explanatory variables and deterministic seasonality. Using Newey-West standard errors. Model comparison and estimating cumulative effects with valid SEs. Ch12B: Electricity and temperature arizona-electricity, case-shiller-la**
lecture19-advaced-rmarkdown pre-written RMarkdown formatting for data anaysis report. Chunks, general and local set-options, formatting figures, descriptive tables and model comparison tables. Equations, greek letters and hypothesis testing. Organizing appendix. Ch10A: Gender wage gap cps-earnings
lecture20-basic-spatial-vizz pre-written Introducing to spatial visualization via maps (package based maps) and rgdal (user supplied maps). How to create world map and show life expectancy or color the average hotel prices for London boroughs or Vienna districts. Handling maps via geom_polygon and set the scaling, colors, etc. Ch08B: Life expectancy* , Ch03B: Compare hotel prices Vienna vs London* worldbank-lifeexpectancy, hotels-europe
PART III.
lecture21-cross-validation seminar Model comparison introduced by BIC and RMSE. Limitations of these comparisons. Cross-validation: using different samples to tackle overfitting. The caret package. Ch13A Predicting used car value with linear regressions and Ch14A Predicting used car value: log prices used-cars
ecture22-lasso seminar Feature engineering for LASSO: interactions and polynomials. Cross-validation in detail. LASSO (and RIDGE, Elastic Net) via glmnet. Training-test samples and the holdout sample to evaluate predictions. LASSO diagnostics. Ch14B Predicting AirBnB apartment prices: selecting a regression model airbnb
lecture23-regression-tree seminar Estimating regression tree via rpart. Understanding regression trees and comparing them to linear regressions. Tuning and setup of CART. Tree and variable importance plots. CH15A Predicting used car value with regression trees used-cars
lecture24-random-forest seminar Data cleaning and feature engineering specifics for random forest (RF). Estimate RFs via ranger. Examine the results of RFs with variable importance plots, and partial dependence plots, and check the quality of predictions in (important) subgroups. Gradient Boosting Method (GBM) via gbm package. Prediction comparisons (prediction horse-race) for OLS, LASSO, CART, RF, and GBM. Ch16A Predicting apartment prices with random forest airbnb
lecture25-classification-wML seminar Predicting probabilities and classification with machine learning tools. Cross validated logit models. LASSO with logit, CART, and Random Forest (bonus: why not use Classification Forest). Classification of probabilities, ROC curve, and AUC. Confusion Matrix. Model comparison via RMSE or AUC. User-defined loss function to weight false-positive and false-negative rate. Optimizing threshold value for classification to get best loss function value. CH17A Predicting firm exit: probability and classification bisnode-firms
lecture26-long-term-time-series-wML seminar Forecasting time series data on the long run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. Modeling with deterministic trend, seasonality and other dummy variables for long term horizon. Evaluation of model and forecast precision. prophet as machine learning tool for time series data. Ch18A Forecasting daily ticket sales for a swimming pool swim-transactions
lecture27-short-term-time-series-ARIMA-VAR seminar Forecasting time series data on the short run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. ARIMA and VAR models for short term forecasting. Evaluation of forecasts on short run: performance on hold out set, fan-chart to assess risks and stability of forecasting performance on an extended time period. CH18B Forecasting a house price index case-shiller-la

*case study was the base for the material, but coding material is modified

**only used in homework

Folder structure within lectures

Within each lecture there is the following folder structure:

  • raw_codes: includes codes, which are ready to use during the course but require some live coding in class.
  • complete_codes: includes codes with suggested solutions to codes in raw_codes
  • data: in some cases, there is a data folder, which includes data files (typically in '.csv'). I have found it crucial during live-coding classes to make sure everybody has the same data.
  • if there are no folders then:
    • lecture has a notebook format, which implies a complete live-coding class (mostly introduction or technical ''hard-core coding'' lectures)
    • lecture has a complete R-script. In this case, the lecturer should pay attention to the interpretation of the material itself rather than to coding. Typically this is for more advanced case studies (chapters 13-18), where there is no new coding technique, but interpreting the results might be challenging.

Learning outcomes and relation to the book

Probably, the largest difference compared to the book is that data handling is the most challenging and most time-consuming part of coding, while it is a relatively little (but as important!) part of the book. It is always a challenge to keep up with the material if the two courses (Data Analysis and Coding) are running parallel. Experience shows that lecture05-data-exploration in this course is the first truly common point with the book and lecture06-rmarkdown101 enables students to submit data analysis material via pdf or HTML. This coding material was developed such that it catches up with the book as quickly as possible, showing truly essential tools to do data handling with the data in an easy way. The result is that after 6 lectures from both courses (teaching Part I. of the book) there is room for common assignment in the form of a descriptive analysis: e.g. carry out a data-collection exercise, clean the data and do exploratory analysis. The 'cost' is that apart from some references or homework there is no true connection between the two courses before lecture05-data-exploration in coding and the data handling skills can be improved even more. Therefore do not expect students to be able to solve (all) of the data exercises from the book (however, there were some positive surprises during the years).

In contrast, Part II in the book deals with regressions of various forms. This is fairly simple from the coding perspective, which allows the lecturer to deepen students' knowledge of

  1. basic coding principles;
  2. add further data handling practices to students' toolkit, and
  3. provide more skills on Rmarkdown, while following the material of the book.

If material is properly taught -- for Part III of the book -- there is no need for an extra coding course, but a simple seminar type of supplement, which put emphasis on interpretation and practice of machine learning methods. This material is provided in the folder part-III-case-studies. In principle after these materials, students should be able to code by themself and understand and work with case study materials related to Part IV.

Case studies and coding lectures

Or one can relate each case study from the book to specific lectures.

Chapter Case-study Lecture
Chapter 1 ch01-hotels-data-collect lecture03-tibbles**
Chapter 2 ch02-football-manager-success lecture03-tibbles*
ch02-hotels-data-prep lecture04-data-munging
ch02-immunization-crosscountry lecture04-data-munging**
Chapter 3 ch03-city-size-japan lecture05-data-exploration**
ch03-distributions-height-income lecture05-data-exploration**
ch03-football-home-advantage lecture05-data-exploration**
ch03-hotels-europe-compare lecture05-data-exploration**, lecture07-ggplot-indepth
ch03-hotels-vienna-explore lecture05-data-exploration**
ch03-simulations lecture10-random-numbers
Chapter 4 ch04-management-firm-size lecture05-data-exploration**, lecture07-ggplot-indepth
Chapter 5 ch05-stock-market-loss-generalize lecture09-loops, lecture10-random-numbers, lecture11-functions
Chapter 6 ch06-online-offline-price-test lecture05-data-exploration*, lecture11-functions**
ch06-stock-market-loss-test lecture04-data-munging**, lecture11-functions*
Chapter 7 ch07-hotels-simple-reg lecture12-intro-to-regression
ch07-ols-simulation lecture12-intro-to-regression with lecture10-random-numbers
Chapter 8 ch08-hotels-measurement-error lecture13-feature-engineering
ch08-hotels-nonlinear lecture14-simple-regression**
ch08-life-expectancy-income lecture14-simple-regression
Chapter 9 ch09-gender-age-earnings lecture15-advanced-linear-regression**
ch09-hotels-europe-stability lecture15-advanced-linear-regression
Chapter 10 ch10-gender-earnings-understand lecture15-advanced-linear-regression**, lecture19-advaced-rmarkdown
ch10-hotels-multiple-reg lecture15-advanced-linear-regression
Chapter 11 ch11-australia-rainfall-predict lecture16-binary-models**
ch11-smoking-health-risk lecture16-binary-models
Chapter 12 ch12-electricity-temperature lecture18-timeseries-regression
ch12-stock-returns-risk lecture17-dates-n-times**
ch12-time-series-simulations All of the following**: lecture17-dates-n-times, lecture09-loops and lecture10-random-numbers
Chapter 13 ch13-used-cars-reg lecture21-cross-validation - first part
Chapter 14 ch14-used-cars-log lecture21-cross-validation - second part
ch14-airbnb-reg lecture22-lasso
Chapter 15 ch15-used-cars-cart lecture23-regression-tree
Chapter 16 ch16-airbnb-random-forest lecture24-random-forest
Chapter 17 ch17-predicting-firm-exit lecture25-classification-wML
Chapter 18 ch18-swimmingpool lecture26-long-term-time-series-wML
ch18-case-shiller-la lecture27-short-term-time-series-ARIMA-VAR

*partial match: the case study is only used as a starting point for the lecture.

**students can understand and replicate material based on that lecture

Example course

As an example for a coding course, which takes one 100-mins class per week for a semester (12 weeks), we have taught the followings:

Class Lecture(s) Comments
Class 01 lecture00-intro, lecture01-coding-basics Installation of R, RStudio, and tidyverse package along with knitting an RMarkdown is asked to be done before the class. From coding basics some materials (e.g. numeric vs integer vs double, or indexing or lists) are left out if I run out of time.
Class 02 lecture02-data-imp-n-exp, lecture03-tibbles Sometimes lecture03-tibbles finished on next class.
Class 03 lecture04-data-munging, start: lecture05-data-exploration Ask about RMarkdown knitting.
Class 04 Finish: lecture05-data-exploration, lecture06-rmarkdown101 At this point, should assess students that they understand the basics of coding and make sure nobody is struggling. From this class they should be able to prepare for submitting a project for 6th week's assessment, which should be 2 weeks from this point.
Class 05 lecture07-ggplot-indepth, lecture08-conditionals This class provides some room for repetition or clarifying concepts.
Class 06 lecture09-loops, lecture10-random-numbers and lecture11-functions Should be a more relaxed class as during these days there are many (other) assessment for student and concentrate more on the joy of programming. Many students may already know this material, try to come up with some entertaining tasks for them as well.
Class 07 lecture12-intro-to-regression, lecture13-feature-engineering Feature engineering is new material, but fits here quite well. Class 07 should be after first class from Part II, which discusses Chapter 7.
Class 08 lecture14-simple-regression Great opportunity for in-class (team) work for students with live coding.
Class 09 lecture15-advanced-linear-regression Make sure students covered Chapter 10 from the book. If not, spatial data visualization is a great substitute here.
Class 10 lecture16-binary-models In some cases this material is covered as a seminar from the course that discusses Part II. This provides an opportunity to fill any gaps or make class 12 not so dense, by jumping to the next class's material.
Class 11 lecture17-dates-n-times, lecture18-timeseries-regression If short in time, skip lecture17-dates-n-times
Class 12 lecture19-advaced-rmarkdown, lecture20-basic-spatial-vizz Two paths: discuss lecture19-advaced-rmarkdown in detail with the whys as well, but then there is no time for lecture20-basic-spatial-vizz. Or stick with the technical details in both lectures, which allows higher probability to finish.
Class * lecture20-basic-spatial-vizz This lecture seldomly fits into the timeframe of the class, especially if this coding class runs along with theory classes for Part I and II and serves as a supplement both in coding and understanding the material. However, if there is a mismatch, this class can be flexibly used as a substitute (e.g. theory class is lagging behind)

Our decisions -- you may alter

  • Tidyverse and not data.table. Some friends love data.table. But it seems, tidyverse has become the more popular choice, especially at a starter level.
  • Starting with rm(list = ls()) Yes, we know. There is a strong view suggesting project based workflow "If the first line of your R script is rm(list = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE". We are warned directly, too. At the same time, for beginners, this seems a good start. So we kept it for lectures 01-20, not beyond. Feel free to use a version without.
  • Do descriptive tables with Datasummary -- takes a bit of time to get used to be nice.
  • All regressions (except when we start) is with fixest. We think it is the future regression command for all uses.

Our thanks

Thanks to all folks who contributed to the codebase for the course, especially Gábor Kézdi, co-author of the book. But also thanks to Zsuzsa Holler, Kinga Ritter, Ádám Víg, Jenő Pál, János Divényi, Marc Kaufmann, Gábors' and Ágoston's many students. Big thanks to Laurent Bergé, Grant McDermott and Vincent Arel-Bundock for awesome packages and all the help on coding over several years.

Found an error or have a suggestion?

Awesome, we know there are errors and bugs. Or just much better ways to do a procedure.

To make a suggestion, please open a GitHub issue here with a title containing the case study name. You may also contact us directly.