New

Site Reliability Engineer II

Microsoft
United States, Washington, Redmond
Apr 05, 2025
OverviewMicrosoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world. Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture. Within Azure Data, the Microsoft Fabric platform team builds and maintains the operating system and provides customers a unified data stack to run an entire data estate. The platform provides a unified experience, unified governance, enables a unified business model and a unified architecture. This team (SRE) ensures the reliability, scalability, and performance of systems and services. By integrating software engineering with IT operations, the team automates processes, manages incidents, and enhances system resilience. Acting as a bridge between development and operations, SREs help organizations maintain highly reliable and efficient systems while enabling fast and seamless software delivery. We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served. Responsibilities* Work with all aspects of a high throughput and multi-tenant service* Collaborate effectively within the team and with partner teams across Microsoft.* Be part of the on-call rotation for maintaining service health.* Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams.* Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement.* Document and define existing data engineering processes, data and technology, while evaluating them for optimization.Core responsibilities breakdown includes:* System Reliability & Uptime - Ensuring high availability of services.* Incident Management - Detecting, responding to, and mitigating system failures.* Performance Monitoring - Tracking system health and resolving bottlenecks.* Automation & Tooling - Reducing manual work through scripts and automation.* Capacity Planning - Scaling infrastructure efficiently to handle demand.* Postmortems & Continuous Improvement - Analyzing failures to prevent recurrence.* Embody our culture and values