Only payments via credit card are left

pavelk2 · Sep 5, 2016 · 0215e8f · 0215e8f
1 parent f5590f4
commit 0215e8f
Show file tree

Hide file tree

Showing 10 changed files with 46 additions and 44 deletions.
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-10-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-10-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-11-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-11-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-16-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-16-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-17-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-17-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-18-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-18-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-19-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-19-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-20-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-20-1.png
diff --git a/NYCtrips_files/figure-markdown_github/unnamed-chunk-6-1.png b/NYCtrips_files/figure-markdown_github/unnamed-chunk-6-1.png
diff --git a/README.md b/README.md
@@ -73,26 +73,26 @@ str(trips_short)
 ```
 
     ## 'data.frame':    250000 obs. of  20 variables:
-    ##  $ vendor_id         : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 1 2 1 ...
-    ##  $ pickup_datetime   : POSIXlt, format: "2015-07-22 07:27:40" "2015-07-11 12:28:32" ...
-    ##  $ dropoff_datetime  : POSIXlt, format: "2015-07-22 07:41:14" "2015-07-11 12:46:30" ...
-    ##  $ passenger_count   : int  1 1 1 5 1 2 1 1 1 1 ...
-    ##  $ trip_distance     : num  3.08 5.7 6.9 1.51 2.1 1.92 3.11 7.2 0.59 1.9 ...
+    ##  $ vendor_id         : Factor w/ 2 levels "1","2": 2 2 1 1 2 2 2 1 2 1 ...
+    ##  $ pickup_datetime   : POSIXlt, format: "2015-07-16 22:03:44" "2015-07-03 13:39:46" ...
+    ##  $ dropoff_datetime  : POSIXlt, format: "2015-07-16 22:10:38" "2015-07-03 13:55:32" ...
+    ##  $ passenger_count   : int  1 2 1 1 4 1 1 1 5 2 ...
+    ##  $ trip_distance     : num  1.58 3.53 3.3 11.6 0.4 5.3 2.23 1.7 0.85 1.5 ...
     ##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
-    ##  $ pickup_latitude   : num  40.7 40.8 40.8 40.7 40.7 ...
+    ##  $ pickup_latitude   : num  40.7 40.7 40.8 40.7 40.8 ...
     ##  $ rate_code         : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
     ##  $ store_and_fwd_flag: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
-    ##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
-    ##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.7 ...
+    ##  $ dropoff_longitude : num  -74 -74 -74 -73.9 -74 ...
+    ##  $ dropoff_latitude  : num  40.8 40.8 40.8 40.8 40.8 ...
     ##  $ payment_type      : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
-    ##  $ fare_amount       : num  12 19 25 8 10 9 13.5 25.5 4.5 9 ...
-    ##  $ extra             : num  0 0 0 0.5 1 0 0 0 0 0 ...
+    ##  $ fare_amount       : num  7.5 14.5 14 43 4 19.5 9.5 9.5 5.5 11.5 ...
+    ##  $ extra             : num  0.5 0 0.5 0 0 0.5 0.5 1 0 0 ...
     ##  $ mta_tax           : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
-    ##  $ tip_amount        : num  1.2 2 7.74 1 2 1 3 2 1.06 2.9 ...
+    ##  $ tip_amount        : num  1.76 2.7 3.05 10 1.44 4.16 3.24 2 1.26 2 ...
     ##  $ tolls_amount      : num  0 0 0 0 0 0 0 0 0 0 ...
     ##  $ imp_surcharge     : num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
-    ##  $ total_amount      : num  14 21.8 33.5 10.3 13.8 ...
-    ##  $ tip_percentage    : num  9.37 10.1 30 10.75 16.95 ...
+    ##  $ total_amount      : num  10.56 18 18.35 53.8 6.24 ...
+    ##  $ tip_percentage    : num  20 17.6 19.9 22.8 30 ...
 
 5. Explore data
 ---------------
@@ -102,14 +102,14 @@ Let's see how our trips are spread geographically:
     ## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan+New+York&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
     ## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan%20New%20York&sensor=false
 
-    ## Warning: Removed 30 rows containing missing values (geom_point).
+    ## Warning: Removed 34 rows containing missing values (geom_point).
 
 ![](NYCtrips_files/figure-markdown_github/unnamed-chunk-6-1.png)
 
 Let's see how tip percentages look like:
 
-    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-    ##     0.017    16.950    20.000    19.640    20.000 29330.000
+    ##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
+    ##     0.01    17.05    20.00    19.80    20.00 29330.00
 
 The maximum value seems to be extreme. Let's see how extreme are the top percentiles:
 
@@ -118,7 +118,7 @@ quantile(trips_short$tip_percentage,c(0.95,0.99,0.999))
 ```
 
     ##       95%       99%     99.9% 
-    ##  29.64602  33.89831 100.00036
+    ##  29.67480  34.09091 103.17488
 
 To make our data more consistent we remove all records with tips greater than 100%:
 
@@ -154,23 +154,23 @@ Now let's see if there are any variables affecting the tip\_percentage people pa
     ## 
     ## Residuals:
     ##     Min      1Q  Median      3Q     Max 
-    ## -20.490  -2.228   0.695   1.178  82.103 
+    ## -19.987  -2.264   0.637   1.203  83.276 
     ## 
     ## Coefficients:
     ##                        Estimate Std. Error t value Pr(>|t|)    
-    ## (Intercept)          443.937612  34.687504  12.798  < 2e-16 ***
-    ## trip_distance         -0.109540   0.003753 -29.185  < 2e-16 ***
-    ## pickup_datetime$hour  -0.001396   0.001826  -0.764  0.44465    
-    ## pickup_datetime$yday   0.002651   0.001374   1.930  0.05365 .  
-    ## pickup_longitude       4.933525   0.380439  12.968  < 2e-16 ***
-    ## pickup_latitude       -1.473956   0.469020  -3.143  0.00167 ** 
-    ## passenger_count        0.047608   0.009046   5.263 1.42e-07 ***
+    ## (Intercept)          539.769443  36.050848  14.972  < 2e-16 ***
+    ## trip_distance         -0.129574   0.004189 -30.933  < 2e-16 ***
+    ## pickup_datetime$hour  -0.003530   0.001832  -1.927   0.0540 .  
+    ## pickup_datetime$yday  -0.001601   0.001378  -1.162   0.2453    
+    ## pickup_longitude       6.304339   0.393750  16.011  < 2e-16 ***
+    ## pickup_latitude       -1.313044   0.472955  -2.776   0.0055 ** 
+    ## passenger_count        0.036227   0.009083   3.989 6.65e-05 ***
     ## ---
     ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
     ## 
-    ## Residual standard error: 6.038 on 249743 degrees of freedom
-    ## Multiple R-squared:  0.003624,   Adjusted R-squared:  0.0036 
-    ## F-statistic: 151.4 on 6 and 249743 DF,  p-value: < 2.2e-16
+    ## Residual standard error: 6.063 on 249726 degrees of freedom
+    ## Multiple R-squared:  0.004049,   Adjusted R-squared:  0.004025 
+    ## F-statistic: 169.2 on 6 and 249726 DF,  p-value: < 2.2e-16
 
 R squared is very small so this linear regression does not represent well out dataset, even taking into account we have some statistically significant p-values for slopes of some variables.
 
@@ -192,19 +192,19 @@ zones <- group_data(trips_short_zip)
 str(zones)
 ```
 
-    ## 'data.frame':    207 obs. of  12 variables:
-    ##  $ zipcode                    : Factor w/ 207 levels "10001","10002",..: 1 2 3 4 5 6 7 8 9 10 ...
-    ##  $ lat                        : num  40.7 40.7 40.7 40.7 40.7 ...
+    ## 'data.frame':    210 obs. of  12 variables:
+    ##  $ zipcode                    : Factor w/ 210 levels "10001","10002",..: 1 2 3 4 5 6 7 8 9 10 ...
+    ##  $ lat                        : num  40.8 40.7 40.7 40.7 40.7 ...
     ##  $ lon                        : num  -74 -74 -74 -74 -74 ...
-    ##  $ amount                     : num  1449 5299 13373 1121 598 ...
-    ##  $ tip_percentage.mean        : num  19.4 18.7 19 18.3 19.3 ...
+    ##  $ amount                     : num  1401 5466 13326 1062 563 ...
+    ##  $ tip_percentage.mean        : num  19.7 18.8 19.1 18.6 18.5 ...
     ##  $ tip_percentage.median      : num  20 20 20 20 20 ...
-    ##  $ tip_amount.mean            : num  2.39 2.65 2.35 3.4 3.63 ...
-    ##  $ tip_amount.median          : num  1.96 2.16 1.96 2.95 3 2.15 2 2 1.86 1.95 ...
-    ##  $ tip_amount.mean_round      : Factor w/ 19 levels "0$","1$","10$",..: 9 12 9 12 13 12 9 9 9 9 ...
-    ##  $ tip_amount.median_round    : Factor w/ 19 levels "0$","1$","10$",..: 9 9 9 12 12 9 9 9 9 9 ...
-    ##  $ tip_percentage.median_round: Factor w/ 23 levels "10%","11%","12%",..: 11 11 11 11 11 11 11 11 11 11 ...
-    ##  $ tip_percentage.mean_round  : Factor w/ 25 levels "10%","11%","12%",..: 10 10 10 9 10 10 10 10 10 10 ...
+    ##  $ tip_amount.mean            : num  2.45 2.63 2.35 3.42 3.52 ...
+    ##  $ tip_amount.median          : num  2 2.19 1.96 2.95 3 2.15 1.96 2 1.89 1.95 ...
+    ##  $ tip_amount.mean_round      : Factor w/ 22 levels "0$","1$","10$",..: 10 14 10 14 17 14 10 10 10 10 ...
+    ##  $ tip_amount.median_round    : Factor w/ 23 levels "0$","1$","10$",..: 11 11 11 15 15 11 11 11 11 11 ...
+    ##  $ tip_percentage.median_round: Factor w/ 22 levels "10%","12%","15%",..: 9 9 9 9 9 9 9 9 9 9 ...
+    ##  $ tip_percentage.mean_round  : Factor w/ 25 levels "10%","12%","13%",..: 10 8 8 8 7 8 8 8 8 8 ...
 
 The lower the *amount* of records for a given zone we have the more extreme the aggregated values are in comparison to zones with more records. Because of that in our analysis we consider only zones with more than 500 records (500 is taken arbitrary as a big enough number of records).
 
@@ -270,7 +270,7 @@ welch_test <- t.test(TimesSquare_10036$tip_percentage,WorldTradeCenter_10250$tip
 welch_test$p.value
 ```
 
-    ## [1] 4.849112e-10
+    ## [1] 1.810699e-08
 
 The Welch test shows that people starting from Times Square really tend to pay more tips (p-value=0). Let's check how much more they tend to pay:
 
@@ -280,9 +280,9 @@ cohen_distance$estimate
 ```
 
     ## Treatment 
-    ## 0.2102142
+    ## 0.1819663
 
-Cohen's d is 0.21.
+Cohen's d is 0.18.
 
 Let's compare two airports (JFK and LaGuardia):
 
@@ -291,7 +291,7 @@ welch_test <- t.test(LaGuardia_11371$tip_percentage,JFK_11430$tip_percentage,alt
 welch_test$p.value
 ```
 
-    ## [1] 1.134836e-18
+    ## [1] 2.992254e-18
 
 **People taking taxis from LaGuardia airport seem to pay more than people from JFK (p-value=0)**.
 
@@ -301,7 +301,7 @@ cohen_distance$estimate
 ```
 
     ## Treatment 
-    ## 0.1822718
+    ## 0.1807705
 
 Cohen's d is 0.18.
 

diff --git a/data_preparation.r b/data_preparation.r
@@ -44,6 +44,8 @@ clean_data <- function(trips){
 	# remove records with Latitudes which is not NYC (the limits are taken by hand looking at NYC Google Map)
 	trips <- trips[trips$dropoff_latitude > NYC_region[2],]
 	trips <- trips[trips$dropoff_latitude < NYC_region[4],]
+	# remove all trips where payment was done not via credit card, as only then we have the info about tips
+	trips <- trips[trips$payment_type == 1,]
 	# remove all trips where tip amount is 0 or negative (as we are interested in very high tips)
 	trips <- trips[trips$tip_amount > 0,]
 	# remove all trips where passenger count is 0