End-to-end AWS data pipeline β ingesting, transforming and visualizing 173+ million NYC taxi trips with on-demand historical queries back to 2009.
The complete AWS pipeline: NYC TLC data is ingested via Lambda into S3 raw storage, transformed by Apache Spark on EC2, and served through a Streamlit dashboard. RDS stores pipeline metadata, and GitHub Actions CI/CD automates deployments.
The main entry point of the dashboard. Displays key project metrics β 173M+ processed records, 5 years of pre-loaded data, and the complete tech stack powering the pipeline. Dark theme with Space Mono typography.
Deep dive into any month from 2021β2025. KPI cards for total trips, avg duration, distance, revenue, and passengers. Hourly distribution, Yellow vs Green breakdown, payment methods, and distance-revenue scatter.
Scatter map of NYC taxi zones on Mapbox dark tiles. Four metrics: Total Pickups, Total Dropoffs, Avg Revenue, Avg Distance. Zones sized and colored by activity, with ranked bar chart alongside.
Comparative analysis 2021β2025. Monthly trip volume trends, avg revenue and distance evolution, total trips per year. Highlights post-pandemic recovery and seasonal patterns.
Query any month 2009β2020 in real time. Quick View with pandas (~40s, 200K sample) or Full Pipeline with Spark (~4-9 min, full ETL). Auto schema detection for 16+ years of evolving data formats.