NYC Taxi Intelligence

End-to-end AWS data pipeline β€” ingesting, transforming and visualizing 173+ million NYC taxi trips with on-demand historical queries back to 2009.

173M+Records
5Years Pre-loaded
16+Years On-demand
2Taxi Types
AWS Lambda AWS S3 AWS EC2 Apache Spark Docker Streamlit Python
01

Architecture

πŸ—οΈ

System Architecture

The complete AWS pipeline: NYC TLC data is ingested via Lambda into S3 raw storage, transformed by Apache Spark on EC2, and served through a Streamlit dashboard. RDS stores pipeline metadata, and GitHub Actions CI/CD automates deployments.

Architecture Diagram
AWS Cloud Lambda β†’ S3 β†’ Spark EC2 t2.medium GitHub Actions RDS Metadata
02

Home

Home page
🏠

Landing Page

The main entry point of the dashboard. Displays key project metrics β€” 173M+ processed records, 5 years of pre-loaded data, and the complete tech stack powering the pipeline. Dark theme with Space Mono typography.

173M+ Records Live S3 Dark Theme
03

Monthly Dashboard

Monthly Dashboard
πŸ“…

Monthly Analytics

Deep dive into any month from 2021–2025. KPI cards for total trips, avg duration, distance, revenue, and passengers. Hourly distribution, Yellow vs Green breakdown, payment methods, and distance-revenue scatter.

KPI Cards Plotly Charts Cached Data
04

Zone Map

Zone Map
πŸ—ΊοΈ

Interactive NYC Map

Scatter map of NYC taxi zones on Mapbox dark tiles. Four metrics: Total Pickups, Total Dropoffs, Avg Revenue, Avg Distance. Zones sized and colored by activity, with ranked bar chart alongside.

Mapbox 4 Metrics 140+ Zones
05

Annual Comparison

Annual Comparison
πŸ“Š

Year-over-Year Trends

Comparative analysis 2021–2025. Monthly trip volume trends, avg revenue and distance evolution, total trips per year. Highlights post-pandemic recovery and seasonal patterns.

5 Years Trend Lines 60 Months
06

Historical Query

Historical Query
πŸ”

On-Demand Extraction

Query any month 2009–2020 in real time. Quick View with pandas (~40s, 200K sample) or Full Pipeline with Spark (~4-9 min, full ETL). Auto schema detection for 16+ years of evolving data formats.

Spark ETL Pandas 2009–2020 Auto Schema