Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Notebooks for Python in the nyc taxi workshop #1

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7673de9
First commit
Jul 17, 2018
f9f18a2
Update:
Jul 17, 2018
95ed995
Update:
Jul 17, 2018
a3d12db
Update:
Jul 17, 2018
ec37378
Update:
Jul 17, 2018
c6a86a5
Clean-up
Jul 17, 2018
9d2c40a
Clean-up
Jul 17, 2018
ff69c3a
Reoved WIP
Jul 18, 2018
9e41220
init commit
realAngryAnalytics Aug 19, 2018
357e2a6
init commit
realAngryAnalytics Aug 19, 2018
756bd41
init commit
realAngryAnalytics Aug 19, 2018
630caf5
init commit
realAngryAnalytics Aug 19, 2018
f49522b
init commit
realAngryAnalytics Aug 19, 2018
47f8679
init commit
realAngryAnalytics Aug 19, 2018
30a5a20
init commit
realAngryAnalytics Aug 19, 2018
9bd6077
init commit. Note that 3-CommonFunctions is still in scala and has no…
realAngryAnalytics Aug 19, 2018
3b756d1
init commit
realAngryAnalytics Aug 19, 2018
6d54f64
init commit
realAngryAnalytics Aug 19, 2018
d9595fa
init commit
realAngryAnalytics Aug 19, 2018
ea94467
init commit. Note that 3-CommonFunctions is still in scala and has no…
realAngryAnalytics Aug 19, 2018
3964fe6
init commit
realAngryAnalytics Aug 19, 2018
75ca039
Update:
realAngryAnalytics Aug 19, 2018
0a03f92
init commit
realAngryAnalytics Aug 19, 2018
8919ca3
init commit
realAngryAnalytics Aug 19, 2018
994e4b7
init commit
realAngryAnalytics Aug 19, 2018
d3ada3a
init commit
realAngryAnalytics Aug 19, 2018
f0db127
init commit
realAngryAnalytics Aug 19, 2018
670fe90
init commit
realAngryAnalytics Aug 19, 2018
94f4e11
init commit
realAngryAnalytics Aug 20, 2018
1813422
init commit
realAngryAnalytics Aug 20, 2018
cedc0b1
init commit
realAngryAnalytics Aug 20, 2018
2b871fb
init commit
realAngryAnalytics Aug 20, 2018
361ed17
init commit
realAngryAnalytics Aug 20, 2018
719f88e
init commit
realAngryAnalytics Aug 20, 2018
8351f90
init commit
realAngryAnalytics Aug 20, 2018
c343558
init commit. Note that 3-CommonFunctions is still in scala and has no…
realAngryAnalytics Aug 20, 2018
49e1ff3
init commit
realAngryAnalytics Aug 20, 2018
4062440
init commit
realAngryAnalytics Aug 20, 2018
c7cc6ff
init commit
realAngryAnalytics Aug 20, 2018
2af499e
init commit
realAngryAnalytics Aug 20, 2018
82664c0
init commit
realAngryAnalytics Aug 20, 2018
0d0d761
init commit
realAngryAnalytics Aug 20, 2018
819be3b
init commit
realAngryAnalytics Aug 20, 2018
1df51f7
init commit
realAngryAnalytics Aug 20, 2018
6bea21b
init commit
realAngryAnalytics Aug 20, 2018
eedd63a
init commit
realAngryAnalytics Aug 20, 2018
6a2cfc2
init commit
realAngryAnalytics Aug 20, 2018
4c39905
removed unnecessary files from added directory
Aug 20, 2018
ed46f25
remove unneeded directories
Aug 20, 2018
d447c4d
merge wip to master
Aug 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Databricks notebook source
# MAGIC %md
# MAGIC # Mount blob storage
# MAGIC
# MAGIC Mounting blob storage containers in Azure Databricks allows you to access blob storage containers like they are directories.<BR>

# COMMAND ----------

# MAGIC %md
# MAGIC ### 1. Define credentials
# MAGIC To mount blob storage - we need storage credentials - storage account name and storage account key

# COMMAND ----------

storageAccount = "<<your_storage_account>>"
storageAccountConfKey = "fs.azure.account.key." + storageAccount + ".blob.core.windows.net"
storageAccountConfVal= spark.conf.get("spark.hadoop.fs.azure.account.key." + storageAccount + ".blob.core.windows.net")

# COMMAND ----------

# MAGIC %md
# MAGIC ### 2. Mount blob storage

# COMMAND ----------

dbutils.fs.mount(
source = "wasbs://nyctaxi-scratch@{}.blob.core.windows.net/".format(storageAccount),
mount_point = "/mnt/<userid>/data/nyctaxi/scratchDir",
extra_configs = {storageAccountConfKey: storageAccountConfVal})

# COMMAND ----------

# Display directories
display(dbutils.fs.ls("/mnt/<userid>/data/nyctaxi/"))

# COMMAND ----------

# MAGIC %md
# MAGIC ### 3. Refresh mount points

# COMMAND ----------

# Refresh mounts if applicable
dbutils.fs.refreshMounts()

# COMMAND ----------

# MAGIC %md
# MAGIC ### 4. How to unmount

# COMMAND ----------

dbutils.fs.unmount("<yourMountPoint>")
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
// Databricks notebook source
// MAGIC
// MAGIC %md
// MAGIC # Download NYC taxi public dataset

// COMMAND ----------

// MAGIC %md
// MAGIC # DO NOT RUN THIS AT THE WORKSHOP

// COMMAND ----------

//imports
import sys.process._
import scala.sys.process.ProcessLogger

// COMMAND ----------

//================================================================================
// 2. DOWNLOAD LOOKUP DATA
//================================================================================
dbutils.fs.rm("file:/tmp/taxi+_zone_lookup.csv")
"wget -P /tmp https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv" !!
val localPath="file:/tmp/taxi+_zone_lookup.csv"
val wasbPath="/mnt/data/nyctaxi/stagingDir/reference-data/taxi_zone_lookup.csv"

display(dbutils.fs.ls(localPath))
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))

// COMMAND ----------

//================================================================================
// 2. DOWNLOAD 2017 TRANSACTIONAL DATA
//================================================================================
val cabTypes = Seq("yellow", "green", "fhv")
for (cabType <- cabTypes) {
for (i <- 1 to 6)
{
val fileName = cabType + "_tripdata_2017-0" + i + ".csv"
val wasbPath="/mnt/data/nyctaxi/stagingDir/transactional-data/year=2017/month=0" + i + "/type=" + cabType + "/"
val wgetToExec = "wget -P /tmp https://s3.amazonaws.com/nyc-tlc/trip+data/" + fileName
println(wgetToExec)

wgetToExec !!

val localPath="file:/tmp/" + fileName
dbutils.fs.mkdirs(wasbPath)
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))
}
}

// COMMAND ----------

//================================================================================
// 3. DOWNLOAD 2014-16 TRANSACTIONAL DATA
//================================================================================
val cabTypes = Seq("yellow", "green", "fhv")

for (cabType <- cabTypes) {
for (j <- 2014 to 2016)
{
for (i <- 1 to 12)
{
val fileName = cabType + "_tripdata_" + j + "-" + "%02d".format(i) + ".csv"
println(fileName)
val wasbPath="/mnt/data/nyctaxi/stagingDir/transactional-data/year=" + j + "/month=" + "%02d".format(i) + "/type=" + cabType + "/"
val wgetToExec = "wget -P /tmp https://s3.amazonaws.com/nyc-tlc/trip+data/" + fileName
println(wgetToExec)

wgetToExec !!

val localPath="file:/tmp/" + fileName
dbutils.fs.mkdirs(wasbPath)
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))
}
}
}

// COMMAND ----------

//================================================================================
// 4. DOWNLOAD 2013 TRANSACTIONAL DATA - JAN to JULY
//================================================================================

val j = "2013"
val cabType = "yellow"
for (i <- 1 to 7)
{
val fileName = cabType + "_tripdata_" + j + "-" + "%02d".format(i) + ".csv"
println(fileName)
val wasbPath="/mnt/data/nyctaxi/stagingDir/transactional-data/year=" + j + "/month=" + "%02d".format(i) + "/type=" + cabType + "/"
val wgetToExec = "wget -P /tmp https://s3.amazonaws.com/nyc-tlc/trip+data/" + fileName
println(wgetToExec)

wgetToExec !!

val localPath="file:/tmp/" + fileName
dbutils.fs.mkdirs(wasbPath)
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))
}


// COMMAND ----------

//================================================================================
// 4b. DOWNLOAD 2013 TRANSACTIONAL DATA - AUG - DEC
//================================================================================

val j = "2013"
val cabTypes = Seq("yellow", "green")
for (cabType <- cabTypes) {
for (i <- 8 to 12)
{
val fileName = cabType + "_tripdata_" + j + "-" + "%02d".format(i) + ".csv"
println(fileName)
val wasbPath="/mnt/data/nyctaxi/stagingDir/transactional-data/year=" + j + "/month=" + "%02d".format(i) + "/type=" + cabType + "/"
val wgetToExec = "wget -P /tmp https://s3.amazonaws.com/nyc-tlc/trip+data/" + fileName
println(wgetToExec)

wgetToExec !!

val localPath="file:/tmp/" + fileName
dbutils.fs.mkdirs(wasbPath)
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))
}
}

// COMMAND ----------

//================================================================================
// 5. DOWNLOAD 2009-12 TRANSACTIONAL DATA
//================================================================================

val cabType = "yellow"
for (j <- 2009 to 2012)
{
for (i <- 1 to 12)
{
val fileName = cabType + "_tripdata_" + j + "-" + "%02d".format(i) + ".csv"
println(fileName)
val wasbPath="/mnt/data/nyctaxi/stagingDir/transactional-data/year=" + j + "/month=" + "%02d".format(i) + "/type=" + cabType + "/"
val wgetToExec = "wget -P /tmp https://s3.amazonaws.com/nyc-tlc/trip+data/" + fileName
println(wgetToExec)

wgetToExec !!

val localPath="file:/tmp/" + fileName
dbutils.fs.mkdirs(wasbPath)
dbutils.fs.cp(localPath, wasbPath)
dbutils.fs.rm(localPath)
display(dbutils.fs.ls(wasbPath))
}
}

// COMMAND ----------

Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Databricks notebook source
# MAGIC %md
# MAGIC # What's in this exercise?
# MAGIC Learn how to create Hive tables on top of files in DBFS and query them<BR>
# MAGIC Inserting/overwriting etc are covered further in the workshop

# COMMAND ----------

# MAGIC %md
# MAGIC ### 1. Create some sample data

# COMMAND ----------

# MAGIC %sh
# MAGIC
# MAGIC # Create tmp directory
# MAGIC mkdir -p /tmp
# MAGIC cd /tmp
# MAGIC
# MAGIC # Create a text file
# MAGIC cd /tmp
# MAGIC rm -rf us_states.csv
# MAGIC touch us_states.csv
# MAGIC
# MAGIC # Add some content to the file
# MAGIC echo "IL,Illinois" >> /tmp/us_states.csv
# MAGIC echo "IN,Indiana" >> /tmp/us_states.csv
# MAGIC echo "MN,Minnesota" >> /tmp/us_states.csv
# MAGIC echo "WI,Wisconsin" >> /tmp/us_states.csv

# COMMAND ----------

# MAGIC %sh
# MAGIC cat /tmp/us_states.csv

# COMMAND ----------

# MAGIC %md
# MAGIC ### 2. Save data to DBFS

# COMMAND ----------

# MAGIC %fs
# MAGIC ls /mnt/data/<userid>/nyctaxi/scratchDir/<userid>

# COMMAND ----------

# MAGIC %fs
# MAGIC mkdirs /mnt/data/<userid>/nyctaxi/scratchDir/<userid>/testDir/hiveTableTest

# COMMAND ----------

# MAGIC %fs
# MAGIC ls /mnt/data/<userid>/nyctaxi/scratchDir/<userid>/testDir/

# COMMAND ----------

if (dbutils.fs.cp("file:/tmp/us_states.csv","/mnt/<userid>/data/nyctaxi/scratchDir/<userid>/testDir/hiveTableTest")):
dbutils.fs.rm("file:/tmp/us_states.csv",recurse=True)


# COMMAND ----------

dbutils.fs.ls("/mnt/<userid>/data/nyctaxi/scratchDir/<userid>/testDir/hiveTableTest/")

# COMMAND ----------

# MAGIC %fs
# MAGIC
# MAGIC ls /tmp

# COMMAND ----------

# MAGIC %md
# MAGIC ### 3. Define a Hive table for US states

# COMMAND ----------

display(spark.catalog.listDatabases())

# COMMAND ----------

# MAGIC %sql
# MAGIC CREATE DATABASE IF NOT EXISTS <userid>_demo_db;

# COMMAND ----------

# MAGIC %sql
# MAGIC use <userid>_demo_db;
# MAGIC
# MAGIC drop table if exists us_states;
# MAGIC create external table if not exists us_states(
# MAGIC state_cd string,
# MAGIC state_nm string
# MAGIC )
# MAGIC row format delimited
# MAGIC fields terminated by ','
# MAGIC location "/mnt/data/nyctaxi/scratchDir/testDir/hiveTableTest/";

# COMMAND ----------

spark.catalog.setCurrentDatabase("<userid>_demo_db")
spark.catalog.listTables()

# COMMAND ----------

# MAGIC %sql
# MAGIC use <userid>_demo_db;
# MAGIC
# MAGIC select * from us_states;

# COMMAND ----------

# MAGIC %md
# MAGIC
# MAGIC # References:
# MAGIC https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Databricks notebook source
# MAGIC %md
# MAGIC # What's in this exercise?
# MAGIC 101 on working with Azure SQL database

# COMMAND ----------

# MAGIC %md
# MAGIC # 1.0. Azure SQL Database
# MAGIC
# MAGIC Databricks clusters version 3.5 and above come with Azure SQL database/mysql and postgres drivers installed.
# MAGIC In this section, we will learn to work with Azure SQL database
# MAGIC
# MAGIC The following is covered in this section-
# MAGIC 1. Querying a remote Azure SQL database table
# MAGIC 2. Reading from & writing to an Azure SQL database is covered as part of the lab

# COMMAND ----------

# MAGIC %sql
# MAGIC
# MAGIC CREATE DATABASE IF NOT EXISTS <userid>_demo_db;
# MAGIC USE <userid>_demo_db;
# MAGIC
# MAGIC --This table is in Azure SQL Database; We are merely creating a schema on it
# MAGIC DROP TABLE IF EXISTS us_states;
# MAGIC CREATE TABLE us_states
# MAGIC USING org.apache.spark.sql.jdbc
# MAGIC OPTIONS (
# MAGIC url 'jdbc:sqlserver://demodbsrvr.database.windows.net:1433;database=demodb',
# MAGIC dbtable 'us_states',
# MAGIC user 'demodbadmin',
# MAGIC password '<AskYourInstructor>'
# MAGIC )

# COMMAND ----------

# MAGIC %sql
# MAGIC
# MAGIC select * from <userid>_demo_db.us_states where stateCode='IL'
Loading