Building a production MLOps pipeline on Azure
A deep dive into architecting continuous training pipelines with Azure, GitLab CI, and Terraform — lessons learned from real-world deployments.
When I joined the team at knecon, one of our first major challenges was building a continuous training pipeline that could handle the demands of enterprise clients while maintaining strict GDPR compliance. This post walks through the architecture we developed and the lessons we learned along the way.
The Challenge
Our clients needed to retrain models frequently as new data came in, but the existing process was manual and error-prone. Data scientists would run training jobs locally, manually upload models to a shared drive, and coordinate deployments over Slack. This worked for proof-of-concepts but was unsustainable for production.
Architecture Overview
We settled on an architecture built around Azure Machine Learning, with GitLab CI orchestrating the pipeline and Terraform managing infrastructure as code. The key components included automated data validation, reproducible training environments, model versioning with MLflow, and blue-green deployments via KServe.
Key Lessons
- Start with monitoring: We spent too long building features before we had proper observability. Adding Prometheus and Grafana dashboards early would have saved us weeks of debugging.
- Treat ML pipelines like software: Version everything — code, data, models, and configurations. GitOps principles apply to ML just as much as traditional software.
- Plan for compliance from day one: Retrofitting GDPR compliance was painful. Build audit trails, data lineage tracking, and access controls into your initial design.
Results
After six months of iteration, we had cut time-to-market for new models by 50-60% and reduced deployment failures by over 80%. More importantly, our data scientists could focus on model development instead of wrestling with infrastructure.
In a future post, I'll dive deeper into our model serving setup with MLflow and KServe, including the CI/CD patterns that enabled 5x faster deployments.