Site Reliability Engineer (Mid/Senior)

Razer Inc.

AI-generated summary

beta

This job is for a Site Reliability Engineer at Razer, where you'll help build reliable systems for AI products worldwide. You might like this job because you’ll work in a global team and grow your skills in a gamer-centric environment!

Undisclosed

Singapore, Central

Full-Time

Expiring soon

Job Description

Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.

Job Responsibilities :

We are looking for Site Reliability Engineers (SRE) to join our AI Software team. In this role, you will ensure the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. You’ll work closely with software engineers, AI teams and release teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment. This role is ideal for someone who enjoys building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.

Essential Duties and Responsibilities

Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.

Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.

Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.

Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications

Ensure reliability and uptime for experimental, production AI software environments.

Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.

Reduce MTTR through actionable alerts, runbooks, and automated diagnostics.

Automate infrastructure using IaC (Terraform/CloudFormation) and configuration management.

Improve release workflows and integrate with QA for smooth handoff to Release Candidate testing.

Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.

Participate in on-call rotations, incident reviews, and continuous improvement initiatives..

Pre-Requisites :

Qualifications

4+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations

Experience operating production services with significant availability or scaling demands.

Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)

Comfortable with Linux and Docker administration

Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)

Strong ability to code and script ( preferably Bash scripting and Python)

Ability to use or quickly pick up a wide variety of open source technologies and automation tools

Understanding of GPU-based workloads and resource scheduling.

Familiarity with vector databases, embeddings, and inference pipeline

Comfort with frequent, incremental code testing and deployment

Must have good analytical skills to debug deployment problems without taking help from developers

Deep hands-on technical expertise and problem-solving skills

Ability to work in a collaborative, technically challenging environment with rapidly changing requirements.

Education & Experience

Has a Bachelor’s or Master’s degree in computer science, AI or similar discipline from an accredited institution

Travel Requirements

Role based in Singapore office and may require up to 1 travel trip per year.

Are you game?

Job Requirements

Company Benefits

Career advancement

With fifteen offices and three R&D labs worldwide, be part of a global team that transcends time zones and geographical boundaries.

Transparency

You get to enjoy working in an environment that values transparency and collaborative effort.

Global exposure

You'll be at the forefront of the most exciting industry in the world—video games, bringing gamers closer to the games they love.

Additional Info

Experience Level

0 - 10 Years of Experience

Job Specialisation

Pre-Sales / IT Business Analyst / Business Intelligence

Software Development & QA / Testing

System & IT Helpdesk / Database Administrator

Company Profile

Razer Inc.

Razer™ is the world’s leading lifestyle brand for gamers. The triple-headed snake trademark of Razer is one of the most recognized logos in the global gaming and esports communities. With a fan base that spans every continent, the company has designed and built the world’s largest gamer-focused ecosystem of hardware, software and services. Founded in 2005, Razer is dual headquartered in Irvine (California) and...