1. Hourly feature pipeline
GitHub Actions runs the feature workflow every hour. It fetches Islamabad weather and pollutant data, builds lag/rolling/time features, deduplicates records, and writes them into MongoDB Atlas.
Methodology
This page documents the system's architecture, data pipeline operations, and methodologies. It explains the full machine-learning workflow in a clean, transparent format.
Workflow
GitHub Actions runs the feature workflow every hour. It fetches Islamabad weather and pollutant data, builds lag/rolling/time features, deduplicates records, and writes them into MongoDB Atlas.
The training workflow runs daily with a backup trigger. It reads historical features from the cloud feature store, trains multiple models, evaluates them with time-aware validation, and stores metrics for every horizon.
Models compete separately for Day +1, Day +2, and Day +3. The dashboard also exposes an overall ranking so users can understand both horizon-level and global performance.
FastAPI reads the latest registry artifact from MongoDB GridFS and serves predictions to the frontend. Users can use the champion setup or force one model for comparison.
Model strategy
The training code experiments with Ridge Regression, Random Forest, Gradient Boosting, and an MLP neural network. Each model is scored with RMSE, MAE, and R2. The registry stores the metrics, the trained artifact, the horizon champions, and the overall leaderboard.
Validation
The pipeline avoids random shuffling for evaluation, because AQI is a time-series problem. Newer records are held out for testing so the metrics represent future forecasting behavior more honestly.
Safety
The frontend shows useful registry evidence and hides raw artifact IDs by default, keeping the public experience readable while the database still stores full technical proof.
Feature engineering
The feature set combines pollutant concentration, temporal context, historical AQI memory, and rolling pollution behavior.