Cisco logo

Platform Engineer – OpenShift+ AI-ML SRE | 4+ years

Cisco

Bengaluru, Karnataka, IndiaSENIOR
AIMLSRE

Job Description

Design and develop scalable Kubernetes infrastructure with a focus on AI technologies.

Responsibilities

  • The ideal candidate will have strong hands-on expertise in Red Hat OpenShift, proficiency in Golang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such as Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) & GPU frameworks.
  • Design, deploy, administer, and optimize highly available Red Hat OpenShift platforms.
  • Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
  • Develop automation tools, operators, and platform services using Golang and/or Python.
  • Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
  • Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
  • Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
  • Monitor platform health and proactively identify and resolve reliability and performance issues.
  • Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
  • Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
  • Ensure platform security, compliance, and consistency to organizational standards and procedures.
  • Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
  • Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
  • Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
  • GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU

Qualifications

  • 4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
  • Strong hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting.
  • Proficiency in Golang and/or Python for automation and platform engineering.
  • Strong understanding of Linux systems, networking, and distributed systems concepts.
  • Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
  • Strong troubleshooting, debugging, and incident management capabilities.
  • Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) & GPU architectures.
  • Ability to support and participate in 16×5 on-call rotations for critical production environments
  • Knowledge of container and platform security standards.
  • Reliability-first and automation-driven attitude.
  • Strong analytical and problem-solving skills.
  • Ability to work effectively in a fast-paced production environment.
  • Excellent communication and partnership skills.
  • Ownership, accountability, and a customer-focused approach.

Nice to have

  • Familiarity with public cloud platforms (AWS, Azure, or GCP)
  • Familiarity with GitOps methodologies and tools.

Benefits

  • Collaborative environment
  • Professional development opportunities

Interested in this role?

Sign up free to apply on FeedbackAI and get an AI match score for this job.

Platform Engineer – OpenShift+ AI-ML SRE | 4+ years at Cisco | FeedbackAI