Skip to content

Commit

Permalink
Only payments via credit card are left
Browse files Browse the repository at this point in the history
  • Loading branch information
pavelk2 committed Sep 5, 2016
1 parent f5590f4 commit 0215e8f
Show file tree
Hide file tree
Showing 10 changed files with 46 additions and 44 deletions.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-10-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-11-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-16-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-17-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-18-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-19-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-20-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified NYCtrips_files/figure-markdown_github/unnamed-chunk-6-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 44 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,26 +73,26 @@ str(trips_short)
```

## 'data.frame': 250000 obs. of 20 variables:
## $ vendor_id : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 1 2 1 ...
## $ pickup_datetime : POSIXlt, format: "2015-07-22 07:27:40" "2015-07-11 12:28:32" ...
## $ dropoff_datetime : POSIXlt, format: "2015-07-22 07:41:14" "2015-07-11 12:46:30" ...
## $ passenger_count : int 1 1 1 5 1 2 1 1 1 1 ...
## $ trip_distance : num 3.08 5.7 6.9 1.51 2.1 1.92 3.11 7.2 0.59 1.9 ...
## $ vendor_id : Factor w/ 2 levels "1","2": 2 2 1 1 2 2 2 1 2 1 ...
## $ pickup_datetime : POSIXlt, format: "2015-07-16 22:03:44" "2015-07-03 13:39:46" ...
## $ dropoff_datetime : POSIXlt, format: "2015-07-16 22:10:38" "2015-07-03 13:55:32" ...
## $ passenger_count : int 1 2 1 1 4 1 1 1 5 2 ...
## $ trip_distance : num 1.58 3.53 3.3 11.6 0.4 5.3 2.23 1.7 0.85 1.5 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.7 40.8 40.8 40.7 40.7 ...
## $ pickup_latitude : num 40.7 40.7 40.8 40.7 40.8 ...
## $ rate_code : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ store_and_fwd_flag: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.7 40.7 40.7 ...
## $ dropoff_longitude : num -74 -74 -74 -73.9 -74 ...
## $ dropoff_latitude : num 40.8 40.8 40.8 40.8 40.8 ...
## $ payment_type : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 12 19 25 8 10 9 13.5 25.5 4.5 9 ...
## $ extra : num 0 0 0 0.5 1 0 0 0 0 0 ...
## $ fare_amount : num 7.5 14.5 14 43 4 19.5 9.5 9.5 5.5 11.5 ...
## $ extra : num 0.5 0 0.5 0 0 0.5 0.5 1 0 0 ...
## $ mta_tax : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip_amount : num 1.2 2 7.74 1 2 1 3 2 1.06 2.9 ...
## $ tip_amount : num 1.76 2.7 3.05 10 1.44 4.16 3.24 2 1.26 2 ...
## $ tolls_amount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ imp_surcharge : num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 14 21.8 33.5 10.3 13.8 ...
## $ tip_percentage : num 9.37 10.1 30 10.75 16.95 ...
## $ total_amount : num 10.56 18 18.35 53.8 6.24 ...
## $ tip_percentage : num 20 17.6 19.9 22.8 30 ...

5. Explore data
---------------
Expand All @@ -102,14 +102,14 @@ Let's see how our trips are spread geographically:
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan+New+York&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan%20New%20York&sensor=false

## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 34 rows containing missing values (geom_point).

![](NYCtrips_files/figure-markdown_github/unnamed-chunk-6-1.png)

Let's see how tip percentages look like:

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.017 16.950 20.000 19.640 20.000 29330.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 17.05 20.00 19.80 20.00 29330.00

The maximum value seems to be extreme. Let's see how extreme are the top percentiles:

Expand All @@ -118,7 +118,7 @@ quantile(trips_short$tip_percentage,c(0.95,0.99,0.999))
```

## 95% 99% 99.9%
## 29.64602 33.89831 100.00036
## 29.67480 34.09091 103.17488

To make our data more consistent we remove all records with tips greater than 100%:

Expand Down Expand Up @@ -154,23 +154,23 @@ Now let's see if there are any variables affecting the tip\_percentage people pa
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.490 -2.228 0.695 1.178 82.103
## -19.987 -2.264 0.637 1.203 83.276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 443.937612 34.687504 12.798 < 2e-16 ***
## trip_distance -0.109540 0.003753 -29.185 < 2e-16 ***
## pickup_datetime$hour -0.001396 0.001826 -0.764 0.44465
## pickup_datetime$yday 0.002651 0.001374 1.930 0.05365 .
## pickup_longitude 4.933525 0.380439 12.968 < 2e-16 ***
## pickup_latitude -1.473956 0.469020 -3.143 0.00167 **
## passenger_count 0.047608 0.009046 5.263 1.42e-07 ***
## (Intercept) 539.769443 36.050848 14.972 < 2e-16 ***
## trip_distance -0.129574 0.004189 -30.933 < 2e-16 ***
## pickup_datetime$hour -0.003530 0.001832 -1.927 0.0540 .
## pickup_datetime$yday -0.001601 0.001378 -1.162 0.2453
## pickup_longitude 6.304339 0.393750 16.011 < 2e-16 ***
## pickup_latitude -1.313044 0.472955 -2.776 0.0055 **
## passenger_count 0.036227 0.009083 3.989 6.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.038 on 249743 degrees of freedom
## Multiple R-squared: 0.003624, Adjusted R-squared: 0.0036
## F-statistic: 151.4 on 6 and 249743 DF, p-value: < 2.2e-16
## Residual standard error: 6.063 on 249726 degrees of freedom
## Multiple R-squared: 0.004049, Adjusted R-squared: 0.004025
## F-statistic: 169.2 on 6 and 249726 DF, p-value: < 2.2e-16

R squared is very small so this linear regression does not represent well out dataset, even taking into account we have some statistically significant p-values for slopes of some variables.

Expand All @@ -192,19 +192,19 @@ zones <- group_data(trips_short_zip)
str(zones)
```

## 'data.frame': 207 obs. of 12 variables:
## $ zipcode : Factor w/ 207 levels "10001","10002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 40.7 40.7 40.7 40.7 40.7 ...
## 'data.frame': 210 obs. of 12 variables:
## $ zipcode : Factor w/ 210 levels "10001","10002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 40.8 40.7 40.7 40.7 40.7 ...
## $ lon : num -74 -74 -74 -74 -74 ...
## $ amount : num 1449 5299 13373 1121 598 ...
## $ tip_percentage.mean : num 19.4 18.7 19 18.3 19.3 ...
## $ amount : num 1401 5466 13326 1062 563 ...
## $ tip_percentage.mean : num 19.7 18.8 19.1 18.6 18.5 ...
## $ tip_percentage.median : num 20 20 20 20 20 ...
## $ tip_amount.mean : num 2.39 2.65 2.35 3.4 3.63 ...
## $ tip_amount.median : num 1.96 2.16 1.96 2.95 3 2.15 2 2 1.86 1.95 ...
## $ tip_amount.mean_round : Factor w/ 19 levels "0$","1$","10$",..: 9 12 9 12 13 12 9 9 9 9 ...
## $ tip_amount.median_round : Factor w/ 19 levels "0$","1$","10$",..: 9 9 9 12 12 9 9 9 9 9 ...
## $ tip_percentage.median_round: Factor w/ 23 levels "10%","11%","12%",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ tip_percentage.mean_round : Factor w/ 25 levels "10%","11%","12%",..: 10 10 10 9 10 10 10 10 10 10 ...
## $ tip_amount.mean : num 2.45 2.63 2.35 3.42 3.52 ...
## $ tip_amount.median : num 2 2.19 1.96 2.95 3 2.15 1.96 2 1.89 1.95 ...
## $ tip_amount.mean_round : Factor w/ 22 levels "0$","1$","10$",..: 10 14 10 14 17 14 10 10 10 10 ...
## $ tip_amount.median_round : Factor w/ 23 levels "0$","1$","10$",..: 11 11 11 15 15 11 11 11 11 11 ...
## $ tip_percentage.median_round: Factor w/ 22 levels "10%","12%","15%",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ tip_percentage.mean_round : Factor w/ 25 levels "10%","12%","13%",..: 10 8 8 8 7 8 8 8 8 8 ...

The lower the *amount* of records for a given zone we have the more extreme the aggregated values are in comparison to zones with more records. Because of that in our analysis we consider only zones with more than 500 records (500 is taken arbitrary as a big enough number of records).

Expand Down Expand Up @@ -270,7 +270,7 @@ welch_test <- t.test(TimesSquare_10036$tip_percentage,WorldTradeCenter_10250$tip
welch_test$p.value
```

## [1] 4.849112e-10
## [1] 1.810699e-08

The Welch test shows that people starting from Times Square really tend to pay more tips (p-value=0). Let's check how much more they tend to pay:

Expand All @@ -280,9 +280,9 @@ cohen_distance$estimate
```

## Treatment
## 0.2102142
## 0.1819663

Cohen's d is 0.21.
Cohen's d is 0.18.

Let's compare two airports (JFK and LaGuardia):

Expand All @@ -291,7 +291,7 @@ welch_test <- t.test(LaGuardia_11371$tip_percentage,JFK_11430$tip_percentage,alt
welch_test$p.value
```

## [1] 1.134836e-18
## [1] 2.992254e-18

**People taking taxis from LaGuardia airport seem to pay more than people from JFK (p-value=0)**.

Expand All @@ -301,7 +301,7 @@ cohen_distance$estimate
```

## Treatment
## 0.1822718
## 0.1807705

Cohen's d is 0.18.

Expand Down
2 changes: 2 additions & 0 deletions data_preparation.r
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ clean_data <- function(trips){
# remove records with Latitudes which is not NYC (the limits are taken by hand looking at NYC Google Map)
trips <- trips[trips$dropoff_latitude > NYC_region[2],]
trips <- trips[trips$dropoff_latitude < NYC_region[4],]
# remove all trips where payment was done not via credit card, as only then we have the info about tips
trips <- trips[trips$payment_type == 1,]
# remove all trips where tip amount is 0 or negative (as we are interested in very high tips)
trips <- trips[trips$tip_amount > 0,]
# remove all trips where passenger count is 0
Expand Down

0 comments on commit 0215e8f

Please sign in to comment.