View Our Website View All Jobs

Site Reliability Engineer

Description

QGenda is a fast growing Atlanta-based healthcare software company, with an amazing corporate culture, where we strive to be the best place to be a customer. Our software is used by thousands of hospital departments around the world to automatically generate the most optimized physician work schedules to accommodate complex business rules and accurately schedule the appropriate medical provider based on their skill level, specialty, availability, and preferences.

As a Site Reliability Engineer, you will work with our product development teams to increase the scalability, reliability, and performance of our systems.   You’ll build and extend existing automation for configuration and monitoring of our AWS hosted applications. You’ll evaluate new AWS services and tools to determine if they could be utilized in our environments.  You’ll bring a focus to platform health and monitoring to allow us to deliver the best possible experience for our customers.

 

Job Duties & Responsibilities

  • Partner with software engineering teams to make sure scalability/reliability is designed and implemented in new features and products
  • Write automation code for provisioning and operating infrastructure
  • Establish end-to-end monitoring and alerting on all critical aspects of the system to ensure SLAs and get proactive notifications of possible issues for all systems
  • Design platforms for extremely high uptime metrics and ensure that our production SLAs are measured, monitored and maintained
  • Responsible for the oversight of infrastructure for customer facing applications hosted in AWS within production and pre-production environments including their provisioning
  • Work closely with dev, and ops teams to build highly available, cost effective systems
  • Maintain understanding of new cloud computing capabilities on Amazon Web Services and look for opportunities to utilize those capabilities for our products
  • Participate in service capacity planning and demand forecasting, software performance analysis and system tuning
  • Use extensive metrics to identify issues before they impact our customers
  • Troubleshoot problems across the entire cloud-based stack: network, databases, and application – and build automation to prevent problem recurrence
  • Identifying underlying root causes and provide recommendations or solutions for long term permanent fixes to critical production issues
  • Develop effective documentation, tooling, and alerts to both identify and address reliability risks
  • Participate in on‐call rotation with other team members on the Development Team
  • Promote fundamentals of site reliability across the Product Development department and the organization as a whole

 

Requirements

  • Advanced proficiency with at least one scripting or programming language
  • Hands‐on experience building infrastructure and supporting applications in AWS using services such as Lambda, EC2, ECS, S3, SNS, SQS, RDS, Redshift, and Elasticache
  • Familiarity with configuration management and infrastructure as code (IaaC) tools such as Ansible, Terraform or Cloudformation
  • Solid Windows experience and familiarity with environments using Active Directory
  • Using distributed version control system experience (Git or Mercurial preferred) to check‐in code, branching, merging, pull request, code review, etc.
  • Knowledge of CI/CD best practices and tools such as AWS CodeBuild, Jenkins and TeamCity
  • Experience designing and delivering secure, high performance and highly‐available cloud services
  • Experience working with stakeholders to define and track SLIs, SLOs and SLAs using metrics and monitoring to ensure the objectives are met or exceeded
  • Strong understanding of networking and DNS
Read More

Apply for this position

Required*
Apply with Indeed
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file