The goal of this project is to perform data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.
- Extracted the data and loaded it into Google Cloud Storage for further processing.
- Transformed and modelled the data using fact and dimensional data modelling concepts using Python on Jupyter Notebook.
- Orchestrated the ETL process/data pipeline on Mage and loaded the transformed data into Google BigQuery.
- Developed a dashboard on Looker Studio.
✅ LEARNING LESSONS
-
Initially, I created a cloud instance with 1 CPU and 4GB of memory. However, as I started working on Mage, I experienced significant lag. To overcome this, I created a new instance with 4 CPUs and 16GB of memory instead and performance improved. 👍🏻
-
Mage would occasionally shut down unexpectedly. To resolve this, I restarted SSH which prompted me to switch to a different port. It was only after several attempts that I realised that I had to create a new firewall rule to allow access to the new port. 😅
-
If all else fails, verify the ephemeral URL as mine changed unexpectedly.
✨ INTERESTING INSIGHTS
- Standard rate code generates the highest revenue indicating its popularity among users.
- On average, trips to Newark result in the highest tip at $11.
- Credit cards are the preferred mode of payment accounting for a significant 75% of all transactions, followed by cash.
- One-person trips make up the majority, comprising 66% of all recorded journeys.
- The bustling area around Jackson Ave, near the Queens Plaza station in Long Island City stands out as the top pickup location.
- The highest fare occurs at 2pm suggesting a peak period of travel. In contrast, the lowest fare is observed at 7am, likely due to off-peak hours.
- Python
- Google Cloud Storage
- Mage - modern data pipeline tool
- Google Cloud Compute VM instance
- BigQuery
- Looker Studio
TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.