REMOTE
Site Reliability Engineer
Responsibilities
- Design and maintain Kubernetes clusters across multiple environments (development, staging, production)
- Build automation for cluster deployment, configuration, and management
- Monitor and troubleshoot clusters to ensure high availability and optimal performance
- Implement security best practices for Kubernetes and underlying infrastructure
- Participate in incident response and work to reduce Mean Time To Recovery (MTTR)
- Enhance the reliability and scalability of our Kubernetes infrastructure
- Manage CI/CD pipelines and DevOps tooling
- Collaborate with development teams on deployment strategies and best practices
Requirements
- Deep Kubernetes expertise - CKA certification preferred
- Infrastructure as Code - Experience with 2+ IaC tools (Terraform, Pulumi, etc.)
- Monitoring & Observability - Proficiency with Prometheus, Grafana, and related tools
- Cloud Platforms - Hands-on experience with AWS, Azure, or GCP
- CI/CD - Knowledge of GitHub Actions, GitLab CI, or Azure DevOps
- Networking & Security - Understanding of network fundamentals and security best practices
- Problem-solving - Strong analytical and troubleshooting abilities
- Communication - Fluent English for remote asynchronous work
- Self-motivated - Ability to work independently with an agile approach
Nice-to-haves
- Experience with GitOps tools (Flux, ArgoCD)
- Go programming knowledge or willingness to learn
- Active open-source contributions
- Experience developing Kubernetes operators or controllers
Benefits
- 100% remote work with flexible hours
- Work with cutting-edge cloud-native technologies
- Contribute to open-source projects
- Collaborative, distributed team environment
- Opportunity to shape the future of Kubernetes tooling