Job Description: Site Reliability Engineer

For this position, we’re looking for talented & experienced engineers who have a passion for infrastructure & automation.

As a Site Reliability Engineer (SRE), you will work within the development team to combine software and systems engineering and run large-scale distributed systems. You will also maintain the client's systems' capacity and performance.

Responsibilities:

· Taking part in architecture-level discussions, design, planning, and implementation.

· Researching to ensure what we are building is always the best path forward.

· Documenting each project to facilitate integration for users.

· Driving proof of concepts and minimal viable products for demonstration.

· Designing and delivery of Infrastructure as Code.

· Developing and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.

· Developing and maintaining dashboards for monitoring and observability.

· Supporting multiple services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.

· Incident management and participating in on call rotation.

Education and Experience:

· To succeed in this role, candidates must have a strong foundational knowledge and demonstrated proficiency of Linux/Unix. (Talos)

· At least 5 years of SRE or similar experience as a DevOps or Software Engineer.

· At least two years of programming experience in a conventional programming language.

· Kubernetes knowledge is required. Experience with bare metal / non-managed Kubernetes would be a plus.

· Experience in Python and other scripting languages.

· Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible, Helm, Puppet, or Chef).

· Networking and cloud computing platform experience.

· Proficiency in scripting and programming languages (e.g., Bash, Python, Go, Node, Java, or similar).

· Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack, or similar).

· Experience with Grafana Mimir.

· Familiarity with CI/CD tools and SDLC practices.

· You have strong problem-solving skills and excellent communication skills.

· You can work independently as well as collaboratively in a remote team environment.

You are friendly, collaborative, humble, honest, and always s

View all job openings