Skip to main content
REMOTE

Site Reliability Engineer

Responsibilities

  • Design and maintain Kubernetes clusters across multiple environments (development, staging, production)
  • Build automation for cluster deployment, configuration, and management
  • Monitor and troubleshoot clusters to ensure high availability and optimal performance
  • Implement security best practices for Kubernetes and underlying infrastructure
  • Participate in incident response and work to reduce Mean Time To Recovery (MTTR)
  • Enhance the reliability and scalability of our Kubernetes infrastructure
  • Manage CI/CD pipelines and DevOps tooling
  • Collaborate with development teams on deployment strategies and best practices

Requirements

  • Deep Kubernetes expertise - CKA certification preferred
  • Infrastructure as Code - Experience with 2+ IaC tools (Terraform, Pulumi, etc.)
  • Monitoring & Observability - Proficiency with Prometheus, Grafana, and related tools
  • Cloud Platforms - Hands-on experience with AWS, Azure, or GCP
  • CI/CD - Knowledge of GitHub Actions, GitLab CI, or Azure DevOps
  • Networking & Security - Understanding of network fundamentals and security best practices
  • Problem-solving - Strong analytical and troubleshooting abilities
  • Communication - Fluent English for remote asynchronous work
  • Self-motivated - Ability to work independently with an agile approach

Nice-to-haves

  • Experience with GitOps tools (Flux, ArgoCD)
  • Go programming knowledge or willingness to learn
  • Active open-source contributions
  • Experience developing Kubernetes operators or controllers

Benefits

  • 100% remote work with flexible hours
  • Work with cutting-edge cloud-native technologies
  • Contribute to open-source projects
  • Collaborative, distributed team environment
  • Opportunity to shape the future of Kubernetes tooling
Apply