Are you passionate about maintaining robust and high-performing infrastructures? Do you thrive in managing complex network environments and ensuring system reliability?
Join our infrastructure team and help us elevate operational excellence to new heights.
We are seeking a Lead Site Reliability Engineer (SRE) to take ownership of our systems' infrastructure and reliability. In this leadership role, you will manage a team of SREs and ensure our platform remains highly available, scalable, and secure. You will also define the reliability strategy, mentor the team, and collaborate with cross-functional departments to build robust systems.
Your mission will be to
- Lead, mentor, and grow a team of talented SREs to deliver on key objectives.
- Foster a culture of reliability, collaboration, and continuous improvement within the team.
- Define and track team goals, performance metrics, and deliverables.
- Own the design, implementation, and maintenance of scalable and reliable infrastructure.
- Ensure the uptime, performance, and scalability of all services.
- Manage incident response processes, including root cause analysis and post-mortems.
- Drive automation efforts to reduce manual intervention and improve operational efficiency.
- Work closely with engineering, DevOps, and product teams to identify and resolve infrastructure challenges.
- Partner with stakeholders to define and execute a roadmap for improving system reliability and scalability.
- Define and enforce SLOs, SLAs, and SLIs to measure and maintain reliability standards.
- Ensure compliance with industry standards and best practices for infrastructure security.
- Implement and enforce policies for monitoring, logging, and access control.
Requirements
Background and experience
- Fluency in English (French is a plus).
- Extensive hands-on experience with cloud platforms such as AWS, GCP, or Azure.
- Proven leadership experience managing SRE or DevOps teams.
- Strong expertise in infrastructure-as-code tools (Terraform) and container orchestration (Kubernetes).
- Experience with CI/CD pipelines and automation tools.
- Deep understanding of monitoring and observability tools (e.g., Prometheus, Grafana, OpenTelemetry).
- Demonstrated ability to handle high-pressure situations, such as major incidents or critical outages.
- Familiarity with security best practices and compliance frameworks.
- Strong background in implementing and managing incident response strategies and disaster recovery plans.
Join our team and contribute to a resilient and cutting-edge trading infrastructure that supports Flowdesk's growth and innovation in the crypto markets!
Benefits
> International environment (English is the main language)
> 50% of transportation costs & a sustainable mobility agreement
> Swile lunch voucher (€9.25 per day, 60% covered)
> 100% Alan Blue covered for you and your children
> Gymlib contribution to gym membership
> Top of the range equipment, Macbook, iPhone, keyboard, laptop stand, 4K monitor & headphones
> Team events and offsites
> Coming soon, international mobility & lot of other cool benefits!
Recruitment Process
Are you interested in this job but feel you haven't ticked all the boxes? Don't hesitate to apply and tell us in the cover letter section why we should meet!
Here's what you can expect if you apply
- HR call (30') with a Talent Acquisition
- Technical meeting (60') with the Head of Infrastructure and Data
- Technical Interview (45') with the Lead of Infrastucture
- Video calls with your future team members
- Culture Interview with HR (45')
On the agenda, discussions rather than trick questions! These moments of exchange will allow you to understand how Flowdesk works and its values. But they are also (and above all) an opportunity for you to present your career path and your expectations for your next job!