New

Site Reliability Engineer II - CTJ - Top Secret

Microsoft
United States, Washington, Redmond
Aug 22, 2025
OverviewDo you have a passion for high scale services and working with some of Microsoft's most critical customers? We're looking for a Software Engineer with the right mix of software development, on-line services experience and passion for quality to envision, design, and deliver Office 365 government cloud service offerings.Office 365 is at the center of Microsoft's cloud first, devices first strategy as it brings together cloud versions of our most trusted communication and collaboration products like Exchange, SharePoint, Teams with our cross-platform desktop suites and mobile apps. The Office 365 Enterprise Cloud team works with Microsoft's largest enterprise and government customers to deliver features that meet their specific needs and enable cloud adoption. As one would expect, our customers have the highest expectations for feature quality, security, reliability, availability, and performance.The engineering team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Software Engineer - CTJ - Poly, you will identify and deliver software improvements using your expertise in software development, complexity analysis, scalable system design, and collaboration skills will be required to work closely with other engineering teams to ensure services/systems are highly stable and performant, meeting the expectations of our government customers and users. ResponsibilitiesContributions to Development and DesignIndependently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale.Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of product components or features supported by their team.Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles. Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products.Driving Operational ExcellenceLeverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams.Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models. Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities. Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required. Models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions. Proposes changes and drives implementation of solutions to identified performance and resource challenges.Identifies opportunities to leverage existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production. Monitors the effects of changes across multiple components or features within a single platform or system.Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings.Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X).Technical Knowledge and Domain-Specific ExpertiseDemonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers.Researches and maintains an awareness in industry trends, advances in cloud technologies, new tools, and/or processes for maintaining and improving product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems.Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products.Additional ResponsibilitiesDesign, develop, and deliver the required software engineering features and services to serve and protect O365 government clouds.Proactively identify and reduce issues through design, testing, and implementation of software-based solutions.Collaborate with Engineering and Program Management partners to translate customer, business, and technical requirements into architectural designs and feature releases.Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.Work within a highly skilled team of engineers to deliver revolutionary improvements to the cloud and scale them.