Platform Engineer – OpenShift+ AI-ML SRE | 4+ years

Cisco

Bengaluru, Karnataka, IndiaSENIOR

AIMLSRE

Job Description

Design and develop scalable Kubernetes infrastructure with a focus on AI technologies.

Responsibilities

The ideal candidate will have strong hands-on expertise in Red Hat OpenShift, proficiency in Golang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such as Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) & GPU frameworks.
Design, deploy, administer, and optimize highly available Red Hat OpenShift platforms.
Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
Develop automation tools, operators, and platform services using Golang and/or Python.
Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
Monitor platform health and proactively identify and resolve reliability and performance issues.
Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
Ensure platform security, compliance, and consistency to organizational standards and procedures.
Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU

Qualifications

4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
Strong hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting.
Proficiency in Golang and/or Python for automation and platform engineering.
Strong understanding of Linux systems, networking, and distributed systems concepts.
Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
Strong troubleshooting, debugging, and incident management capabilities.
Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) & GPU architectures.
Ability to support and participate in 16×5 on-call rotations for critical production environments
Knowledge of container and platform security standards.
Reliability-first and automation-driven attitude.
Strong analytical and problem-solving skills.
Ability to work effectively in a fast-paced production environment.
Excellent communication and partnership skills.
Ownership, accountability, and a customer-focused approach.

Nice to have

Familiarity with public cloud platforms (AWS, Azure, or GCP)
Familiarity with GitOps methodologies and tools.

Benefits

Collaborative environment
Professional development opportunities

Job Description

Responsibilities

Qualifications

Nice to have

Benefits

Interested in this role?