SRE Engineering Manager
Engineering Manager, DevOps (SRE)
The Resilience Engineering group is a team of engineers with relationships across the company. Our goal as Resilience Engineering is to provide Infrastructure As Code and develop software capabilities to accelerate product iteration, quality, and security, in addition to owning and supporting the data center and cloud operations. As an Engineering Manager, you’ll lead a team of experienced SREs and developers, collaborate with multiple engineering, product, and business teams, build and evolve our operations practices, deploy DC/cloud infrastructure as needed.
You have the technical breadth and depth in automating operations, experienced in developing and deploying Infrastructure As Code, with strong management chops to successfully define and deliver on the roadmap. You are passionate about making other people successful and you know how to balance near-term operational requirements against roadmap items.
- Building out and automating our world-class infrastructure both on AWS/GCE and our Datacenter. We are not fully hybridized yet but we want to be! Can you help us build it?
- Lead the Resilience team in IDC developing software capabilities to provide and support Infrastructure As Code.
- Main point of escalation for operational issues for this team and day-to-day operations leader.
- Automating the deployment, scaling, monitoring, alerting, and resilience of our platform with a focus on reducing day-to-day operations workloads and building efficient operations.
- Fostering relationships within engineering to provide service level performance analysis
- Performing and developing toolsets to handle operational tasks as required
- Mentoring and guiding internal and external team members to become collaborative, cross-functional engineers
- Having Fun while working to nurture healthier, happier employees both at Castlight and our customers. We want our employees to work smarter, not harder!
- BS in Computer Science, Engineering, a related field with overall 10+ years experience.
- 5+ yrs experience in leading an SRE, DevOps team with solid hands-on development experience balanced with solid experience leading operations teams.
- Ability to develop and support relationships to understand Engineering and Resilience needs.
- Hands-on experience with Chef, Terraform, Consul, knowledge of networking and protocols such as ELB, DNS, TCP/IP, etc.
- Experience with Docker, Kubernetes, Helm, and K8s automation is a strong plus.
- Adept at scripting and ad-hoc automation, especially in Bash/Ruby/Python/Jenkins.
- Well-versed in Linux administration and first-hand exposure to container and orchestration – Docker, Kubernetes.
- A deep background in working in VMWare DataCenter Infrastructures as well as Bare Metal PXE deployments.
- Understanding of monitoring principles and use of alerting tools such as Datadog, Nagios, Prometheus, PagerDuty, etc.
- Values good debugging skills and will work with and mentor others to investigate operational issues.