Job Description: Site Reliability Engineer
For this position, we’re looking for talented & experienced engineers who have a passion for infrastructure & automation.
 
As a Site Reliability Engineer (SRE), you will work within the development team to combine software and systems engineering and run large-scale distributed systems. You will also maintain the client's systems' capacity and performance.
 
Responsibilities:
·      Taking part in architecture-level discussions, design, planning, and implementation.
·      Researching to ensure what we are building is always the best path forward.
·      Documenting each project to facilitate integration for users.
·      Driving proof of concepts and minimal viable products for demonstration.
·      Designing and delivery of Infrastructure as Code.
·      Developing and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
·      Developing and maintaining dashboards for monitoring and observability.
·      Supporting multiple services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
·      Incident management and participating in on call rotation.
Education and Experience:
·      To succeed in this role, candidates must have a strong foundational knowledge and demonstrated proficiency of Linux/Unix. (Talos)
·      At least 5 years of SRE or similar experience as a DevOps or Software Engineer. 
·      At least two years of programming experience in a conventional programming language.
·      Kubernetes knowledge is required. Experience with bare metal / non-managed Kubernetes would be a plus. 
·      Experience in Python and other scripting languages.
·      Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible, Helm, Puppet, or Chef).
·      Networking and cloud computing platform experience.
·      Proficiency in scripting and programming languages (e.g., Bash, Python, Go, Node, Java, or similar).
·      Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack, or similar).
·      Experience with Grafana Mimir.
·      Familiarity with CI/CD tools and SDLC practices. 
·      You have strong problem-solving skills and excellent communication skills.
·      You can work independently as well as collaboratively in a remote team environment.
You are friendly, collaborative, humble, honest, and always s