- Description of the Problem
- Objective
- Technologies
- Data Architecture
- Data Description
- Draw.io
- Jupyter Notebook
- Mage
- Google Bigquery
- Google Data Studio
In the arena of urban transportation planning and ride-hailing services, data analysts frequently utilize historical ride data to develop predictive models. These models are used to forecast future demand, optimize route planning, and inform strategic business decisions.
Take, for instance, an analyst at Uber. The predictive model they work with employs a time-series analysis approach, utilizing historical data such as ride timestamps, origin-destination pairs, ride duration, and wait times, among other relevant indicators.
However, one significant challenge with this approach arises due to the dynamic and unpredictable nature of the urban transportation environment. Unforeseen events like sudden changes in weather conditions, major city events, traffic disruptions, and changes in local regulations can all significantly impact ride demand and availability.
Using the Uber datasets as an example, user behavior and ride demand patterns can differ significantly based on the aforementioned factors. A prediction algorithm that relies heavily on historical data might generate inaccurate forecasts, leading to inefficient resource allocation, suboptimal service levels, and potential revenue loss.
Another complication could arise when these datasets are used across different geographical locations. Uber operates in numerous countries, and the ride data is influenced by local context, such as traffic patterns, cultural practices, and regional regulations. If an analyst does not factor in these local variations while analyzing the datasets, it could lead to incorrect conclusions and decisions.
These scenarios underline the importance of combining historical data analysis with a keen understanding of current local context, relevant events, and other crucial factors to make informed and prudent business decisions.
The datasets containing historical ride data from Uber are invaluable for a multitude of applications. They can be used by urban planners, transportation analysts, and economists for in-depth analysis of urban mobility patterns, demand forecasting, and traffic management.
For Uber and other ride-hailing services, these datasets can serve to optimize service levels, pricing strategies, and resource allocation. By analyzing past ride data, the company can anticipate demand hotspots, identify peak service times, and uncover trends and patterns that can guide strategic and operational decisions.
Furthermore, the datasets can be used for academic and educational purposes, allowing students and researchers to understand urban mobility patterns, study the impact of events on ride demand, and analyze the effects of ride-hailing services on urban transportation systems.
Each Uber dataset, typically structured in .csv files, includes fields like ride timestamps, origin and destination coordinates, ride duration, and wait times. However, despite their immense utility, these datasets have certain limitations. They represent historical patterns, do not account for sudden external events or local context variations, and rely on the accuracy of the recorded data. It's essential to consider these limitations when using this data for predictive modeling or decision-making purposes.
Lastly, it's crucial to use this dataset responsibly, maintaining user anonymity, ensuring data privacy, and complying with all applicable laws and regulations.
The choosen technologies is variative and depends on the case and condition. There are some option for certain case and in this case, I am using this option since it's the easiest.
- Jupyter notebook
- Mage
- Google VM Instances
- GCS (Google Cloud Storage)
- Google Bigquery
- Google Data Studio
The data include all track record of the Uber or NYC taxi data. It consists some columns which we will be generated to be some tables.
These data was collected from kaggle : https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Design the database by building main table (fact table) and branch table (dimension table)
- Fact table consist the numeric data and the dimension table consist the context of that numeric data.
- Put the id (Primary key and foreign key).
- To access the jupyter notebook can use the VM instances (which be provided in GCP) or just use your initial OS.
- Create the github repo and run 'git clone' in new directory.
- Import the necessary package
- Download the data from the website using ( !wget )
- Create the table by duplicate the column from the initial big table
- Generate the id (Primary key of the table) by using the index (It will generate the number from 1)
- Arrange the sequence of the column of each table
- Do the same thing for all table
- Lastly, merge all the table and arrange the sequence of the table column.
In this project, we will use mage which is the modern data pipeline platform. We can do many things here such as load, transform, and export.
- So the firstly, we need to open our Virtual Machine (In this case, I am using google cloud VM Instances)
- Run the requirement environment
- Initialize the mage tool by using the command in mage website
- Start the project by typing ( mage start )
- Now, in your google account, please set the permission to allow the 6789 port which we gonna use for our localhost (Set in the firewall section in GCP)
- Also, copy your external IP address of your VM instances and add the port ( .6789 )
- Then, your mage webpage will be opened
- Then, we gonna create the data pipeline for three steps (extract, load, transformation)
Bigquery platform can be accessed in GCP. Since in the previous step, the load step has created the table in bigquery, so we can directly write the SQL comman to create one table that will be used for data visualization. This table will consist all the information from the entire table but referring to that table (we use join command to connect all of them)
- Join the fact tables with the dimension tables
- Select all the columns and merge to become one table
It's pretty simple, you can connect your bigquery with google data studio and use the created table to be visualized. Build your dasboard which with the hope, it can answer all the problem solving question clearly.
These are my dashboard:
From this two data visualization, we could get many information:
- Number of services usage in certain location
- The amount of that services in certain interval time
- The most frequent used services in total
- etc