Machine Learning Ops Engineer_6+ years
Zorba AI
Bengaluru, Karnataka, IndiaSENIOR
Job Description
Senior Data & ML Operations Engineer managing end-to-end ML pipelines.
Responsibilities
- Monitor end-to-end data pipeline execution and ensure successful daily operations.
- Identify, troubleshoot, and resolve pipeline failures, performance bottlenecks, and production issues.
- Execute reruns and recovery procedures to minimize downtime and maintain SLA compliance.
- Collaborate with cross-functional teams to resolve dependencies, blockers, and integration issues.
- Implement preventive health checks, monitoring frameworks, and robust logging mechanisms.
- Design and maintain dashboards for data validation, reconciliation, and quality monitoring.
- Perform data quality assessments and ensure integrity, consistency, and accuracy of pipeline outputs.
- Develop automated validation frameworks and quality checks across data workflows.
- Build alerts and notification systems for pipeline failures, data anomalies, and operational issues.
- Monitor model performance using statistical and business metrics.
- Detect and analyze data drift, feature drift, and concept drift across production models.
- Support deployment, monitoring, maintenance, and lifecycle management of ML models.
- Implement model explainability techniques and performance reporting frameworks.
- Develop intelligent agent-based solutions for automated monitoring, troubleshooting, and debugging.
- Leverage Generative AI technologies for operational insights, issue summarization, and root cause analysis.
- Automate repetitive operational tasks to improve platform reliability and efficiency.
- Design, enhance, and maintain CI/CD pipelines for data and ML workloads.
- Implement secure authentication mechanisms, including data-based authentication workflows.
- Build and optimize deployment pipelines, release processes, and infrastructure automation.
- Support DevOps best practices for version control, testing, deployment, and monitoring.
- Communicate project status, risks, incidents, and resolutions effectively to stakeholders.
- Ensure timely delivery of operational and project commitments.
- Participate in incident management, root cause analysis, and continuous improvement initiatives.
Qualifications
- Python
- Databricks
- SQL
- Data Engineering & Data Processing
- Machine Learning Engineering
- MLOps
- CI/CD Pipeline Development
- Monitoring & Production Support
- Data Validation & Data Quality Management
- Logging & Observability Tools
- Dashboard Development & Reporting
- Statistical Analysis & Model Monitoring
- Model Explainability Techniques
- Generative AI Applications
- Automation & Agent-Based Systems
- Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD tools
- MLflow
- Apache Spark / PySpark
- Cloud Platforms (Azure, AWS, or GCP)
- Monitoring tools such as Datadog, Grafana, Prometheus, or equivalent
- Experience with LLMs and GenAI frameworks
- Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related field.
- 5+ years of experience in Data Engineering, MLOps, Production Support, or ML Platform Engineering.
- Proven experience managing production-scale data and machine learning systems.
- Strong analytical, troubleshooting, and communication skills.