Platform Engineer – OpenShift+ AI-ML SRE | 4+ years
Cisco
Bengaluru, Karnataka, IndiaSENIOR
AIMLSRE
Job Description
Design and develop scalable Kubernetes infrastructure with a focus on AI technologies.
Responsibilities
- The ideal candidate will have strong hands-on expertise in Red Hat OpenShift, proficiency in Golang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such as Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) & GPU frameworks.
- Design, deploy, administer, and optimize highly available Red Hat OpenShift platforms.
- Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
- Develop automation tools, operators, and platform services using Golang and/or Python.
- Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
- Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
- Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
- Monitor platform health and proactively identify and resolve reliability and performance issues.
- Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
- Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
- Ensure platform security, compliance, and consistency to organizational standards and procedures.
- Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
- Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
- Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
- GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU
Qualifications
- 4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
- Strong hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting.
- Proficiency in Golang and/or Python for automation and platform engineering.
- Strong understanding of Linux systems, networking, and distributed systems concepts.
- Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
- Strong troubleshooting, debugging, and incident management capabilities.
- Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) & GPU architectures.
- Ability to support and participate in 16×5 on-call rotations for critical production environments
- Knowledge of container and platform security standards.
- Reliability-first and automation-driven attitude.
- Strong analytical and problem-solving skills.
- Ability to work effectively in a fast-paced production environment.
- Excellent communication and partnership skills.
- Ownership, accountability, and a customer-focused approach.
Nice to have
- Familiarity with public cloud platforms (AWS, Azure, or GCP)
- Familiarity with GitOps methodologies and tools.
Benefits
- Collaborative environment
- Professional development opportunities