We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Senior Software Engineer - Chaos Engineering

Microsoft
United States, Washington, Redmond
Oct 17, 2025
OverviewThe High Availability (HA) team part of M365 Core, is seeking a Senior Software Engineer - Chaos Engineering. This role is crucial as HA has been a cornerstone of the Substrate backend solution. We continue to explore opportunities for improving and optimizing service reliability. Our continuous strive to provide best service to our customers goes beyond just optimizing the storage stack solution. We work relentlessly on reducing Microsoft capital and operational expenses, as we continue to explore more paths for optimization while maintaining reliable 4.5 9s availability. To achieve that HA has extended its charter beyond traditional database availability and redundancy solution - towards optimizing power efficiency, platform costs, networking costs. The latter will be the major focus of a talented engineer who decides to join our team. Chaos Engineering is the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in production.As part of Chaos team in HA, you will be working closely with partners (Azure, EXO-Exchange Online, MSR-Microsoft Research) to build the next generation of Chaos platform for Substrate. The platform will validate the resilience, architecture choices, predictability and even monitoring and incident response processes of critical components in M365 distributed systems.Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesOwn feature projects that directly impact behavior of High Availability component of Exchange Online (EXO) that reliably provides 4.5 9s of availability.Write production, monitoring, and test code, create reports and conduct performance analysis of storage engine, database replication, networking layer.Research Chaos experiments, identifying opportunities for testing and operational readiness of critical service components.Engage with EXO, Azure, and MSR partners to build interfaces for a modern Chaos experience, improve service resilience, improve predictability and observability of M365 distributed systems.Embody our Culture and Values
Applied = 0

(web-c549ffc9f-j8rxw)