Methodology

How the AQI system thinks, trains, and stays alive.

This page documents the system's architecture, data pipeline operations, and methodologies. It explains the full machine-learning workflow in a clean, transparent format.

Workflow

The project flow from raw data to dashboard.

1. Hourly feature pipeline

GitHub Actions runs the feature workflow every hour. It fetches Islamabad weather and pollutant data, builds lag/rolling/time features, deduplicates records, and writes them into MongoDB Atlas.

2. Daily training pipeline

The training workflow runs daily with a backup trigger. It reads historical features from the cloud feature store, trains multiple models, evaluates them with time-aware validation, and stores metrics for every horizon.

3. Champion selection

Models compete separately for Day +1, Day +2, and Day +3. The dashboard also exposes an overall ranking so users can understand both horizon-level and global performance.

4. Prediction service

FastAPI reads the latest registry artifact from MongoDB GridFS and serves predictions to the frontend. Users can use the champion setup or force one model for comparison.

Model strategy

Multiple models, no hardcoded winner.

The training code experiments with Ridge Regression, Random Forest, Gradient Boosting, and an MLP neural network. Each model is scored with RMSE, MAE, and R2. The registry stores the metrics, the trained artifact, the horizon champions, and the overall leaderboard.

Validation

Time-aware split

The pipeline avoids random shuffling for evaluation, because AQI is a time-series problem. Newer records are held out for testing so the metrics represent future forecasting behavior more honestly.

Safety

System Integrity

The frontend shows useful registry evidence and hides raw artifact IDs by default, keeping the public experience readable while the database still stores full technical proof.

Feature engineering

Signals the model uses.

The feature set combines pollutant concentration, temporal context, historical AQI memory, and rolling pollution behavior.

Pollutants: PM10, PM2.5, CO, NO2, SO2, O3Time features: day and monthTrend features: AQI delta and AQI lagsRolling features: short-window pollutant meansQuality checks: duplicates, nulls, out-of-range AQI, leakage hints