Site Reliability Engineer

We are seeking a Site Reliability Engineer with strong platform development skills and a thorough understanding of securing environments, with a solid grasp of information security and performance optimization.

Why Join Us

Position Summary

We are seeking a Site Reliability Engineer with strong platform development skills and a thorough understanding of securing environments, with a solid grasp of information security and performance optimization. This role focuses on building scalable, secure, and exceptional infrastructure, automating processes wherever possible. Ideal candidates will be excellent problem solvers with the ability to multitask, collaborate effectively, and continuously learn and improve. Strong coding skills are essential, as is the ability to provide a DevOps capability model that enables rapid continuous integration and deployment of application changes. You will also oversee and govern all changes across the environment.

Essential Duties and Responsibilities

  1. Automate and Optimize Systems: Develop, maintain, and enhance automated tools and systems to ensure the high availability, performance, and reliability of services.

  2. Collaborative Development: Work closely with development teams to design and implement scalable software solutions.

  3. Problem Resolution: Identify, troubleshoot, and resolve issues related to infrastructure, network, and system performance.

  4. CI/CD Management: Implement and manage continuous integration and deployment pipelines for streamlined software delivery.

  5. Proactive Monitoring: Monitor service metrics and logs to detect patterns and predict potential issues before they occur.

  6. Incident Response: Participate in the on-call rotation, responding promptly to incidents and emergencies.

  7. Post-Incident Analysis: Conduct thorough post-incident reviews to analyze and prevent future outages.

  8. Cloud Automation: Utilize cloud services and infrastructure as code (IaC) to automate resource provisioning and management.

  9. Comprehensive Documentation: Develop and maintain detailed documentation for system configurations, mapping, processes, and service records.

  10. Best Practices Advocacy: Promote and apply best practices in system security, reliability, and scalability.

Required Qualifications

  1. Education: Bachelor’s degree in computer science, Information Technology, or a related field (or equivalent work experience).
  2. Experience: Proven experience in a Site Reliability Engineer or similar role, typically 3+ years.
  3. Certifications: Relevant certifications such as AWS Certified SysOps Administrator, Google Professional Cloud DevOps Engineer, or Certified Kubernetes Administrator (CKA) are a plus.
  4. Security Knowledge: Solid understanding of cybersecurity best practices and tools.
  5. Scripting Skills: Proficiency in scripting languages such as Bash or PowerShell.
  6. Strong Programming Skills: Proficiency in languages such as Python, Go, Java, or Ruby.
  7. System Administration: Deep understanding of Linux/Unix systems.
  8. Cloud Computing: Experience with cloud platforms like AWS, Google Cloud, or Azure.
  9. Infrastructure as Code (IaC): Familiarity with tools like Terraform, Ansible, or CloudFormation.
  10. CI/CD Tools: Proficiency with continuous integration and deployment tools such as Jenkins, GitLab CI, or CircleCI.
  11. Monitoring and Logging: Experience with monitoring tools (Prometheus, Grafana) and logging systems (ELK stack, Splunk).
  12. Containerization and Orchestration: Knowledge of Docker and orchestration platforms like Kubernetes.
  13. Networking: Strong understanding of network protocols, firewalls, VPNs, and load balancing.
  14. Database Management: Experience with SQL and NoSQL databases (MySQL, PostgreSQL, MongoDB).

Soft Skills

  1. Problem-Solving: Strong analytical and troubleshooting skills.
  2. Collaboration: Excellent teamwork and communication skills to work effectively with cross-functional teams.
  3. Adaptability: Ability to manage multiple tasks and projects in a fast-paced environment.
  4. Attention to Detail: Precision in diagnosing and fixing issues.
  5. Continuous Learning: A proactive attitude towards learning new technologies and improving existing skills.