Job Description: Site Reliability Engineer
For this position, we’re looking for talented & experienced engineers who have a passion for infrastructure & automation.
As a Site Reliability Engineer (SRE), you will work within the development team to combine software and systems engineering and run large-scale distributed systems. You will also maintain the client's systems' capacity and performance.
Responsibilities:
· Taking part in architecture-level discussions, design, planning, and implementation.
· Researching to ensure what we are building is always the best path forward.
· Documenting each project to facilitate integration for users.
· Driving proof of concepts and minimal viable products for demonstration.
· Designing and delivery of Infrastructure as Code.
· Developing and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
· Developing and maintaining dashboards for monitoring and observability.
· Supporting multiple services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
· Incident management and participating in on call rotation.
Education and Experience:
· To succeed in this role, candidates must have a strong foundational knowledge and demonstrated proficiency of Linux/Unix. (Talos)
· At least 5 years of SRE or similar experience as a DevOps or Software Engineer.
· At least two years of programming experience in a conventional programming language.
· Kubernetes knowledge is required. Experience with bare metal / non-managed Kubernetes would be a plus.
· Experience in Python and other scripting languages.
· Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible, Helm, Puppet, or Chef).
· Networking and cloud computing platform experience.
· Proficiency in scripting and programming languages (e.g., Bash, Python, Go, Node, Java, or similar).
· Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack, or similar).
· Experience with Grafana Mimir.
· Familiarity with CI/CD tools and SDLC practices.
· You have strong problem-solving skills and excellent communication skills.
· You can work independently as well as collaboratively in a remote team environment.
You are friendly, collaborative, humble, honest, and always s