Conf42 Incident Management 2023 - Online

Creating On-Call Heroes: Strategies for Effective On-Call Practices

Video size:

Abstract

Good leadership invest in dev centric culture, giving SW engineers the ownership on their services from inception to full production. Great leadership promote a positive and effective on-call culture to empower the engineers. This session will provide all information on how to do it successfully.

Summary

  • Maynath Mahat is an R D group leader at Payoneer. Has been an encore hero since 2012 and a mentor to emerging champions in the field since 2016. Explains how a blend of strategies, onco practices, and nurturing culture can mold encore heroes out of engineers.
  • In the challenging journey of oncall duty, sidekicks are indispensable in navigating through the technological chaos. Being oncall means standing vigilant ready to respond to the summons. On call superheroes go a step further, scouting for potential pitfalls and resolving them before they escalate.
  • Now it's our collective mission to forge our own league of Guardians. Visualize your team as a robust league of heroes. Cultivate practices that uplift your engineers. Forge alliances within your ranks and share your war stories. The adventure to craft a league of extraordinary encore heroes begins now.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Engineer oncall is a common practice. Each company takes it to a different direction, but the idea is similar across all time and space. We want to empower engineers to become fearless guardians of their domain, and with great power comes great responsibility. And we do want them to be accountable and take full responsibility. The oncall period might be daunting for many, but today we'll unravel how a blend of strategies, onco practices, and nurturing culture can mold encore heroes out of engineers hi, I'm Maynath Mahat, an R D group leader at Payoneer, embracing roles as a problem solver for over two decades and a community leader at boat. I'm here to provide you with a panoramic view of the oncall landscape with a rich history as an encore hero since 2012 and a mentor to emerging champions in the field since 2016. Leading by example is a principle I stand by as an engineering leader, always ensuring I'm present and supportive for my on call heroes during and after incidents. Reflecting on my own experiences as a once novice on call hero, I strive to smoothen the journey for new engineers embarking on their own heroic path. Should you choose to accept your mission, you will walk the path of engineers who safeguard the system. Our exploration begins with routine and expectations that form the foundation of effective on call duties. Moving into the on Call Odyssey, we navigate through challenges and experiences from incident alert to resolution, offering a real world view into the lives of onco engineers. Lastly, we'll delve into leadership's role and the nurturing of an effective culture, uncovering how these elements fortify an efficient oncall environments. Let's go. In the challenging journey of oncall duty, sidekicks are indispensable in navigating through the technological chaos. Key players like incident management systems, namely ServiceNow and Pagerduty, along with insightful dashboards and APM tools such as Datadog and Neuralik, become our vigilant allies, creating us at the slightest murmur of trouble. But throughout the battles, our most valuable sidekicks are our teammates, always standing besides us, ensuring that no challenge is faced alone. The cornerstones of oncology routine, usually tailored by team, whether on a sprint, weekly, or other basis, with orchestrations often led by a team leader or designated individual. Tools range from simple documents to sophisticated systems like services, pager duty, zen duty, or others, ensuring smooth sailing and enabling meticulous tracking, all aimed at fostering more serene on call shifts. The oncall motor is clear guard the system's health. Being oncall means standing vigilant ready to respond to the summons. Whether from a teammate or an automated alert, you should respond and acknowledge. I want to state the office. If it's a person calling you in the middle of the night, be nice. Trust that he's not doing that for fun. Embarking on the pivotal oncall odyssey, our hero seamlessly transitions between proactive vigilance and reactive action. The cycle commences with deliberate production, where our engineer monitors dashboard and alerts while examining recent deployment to detect potential incidents. When an issue does emerge, she springs into action, responding promptly. During the resolution phase, our oncall hero navigates through triage, mitigation and resolution, enlisting the aid of necessary teams to reestablish system stability. Even when the immediate crisis is mitigated, she rallies the troops to permanently resolve lingering issues in the aftermath. Reflection and learning are crucial. Our hero spearheads discussions and learning moments that shape enhance future incident management and proactive prevention. This helps make the system stronger and gets the team ready for future problems. Heroes respond to the oncall, but on call superheroes go a step further. They are proactive, scouting for potential pitfalls and resolving them before they escalate. Regular check ins on system health and keen eye on recent updates can prevent troubles when an alert beacon's quick action keeps minor glitches from becoming major crisis in unpleasant times. Whether you catch a whisper of trouble, which is ideal, or see a bad signal blazing in the sky, what's next on the agenda? It's all about triage, mitigation and holding the investigation of root cause for later on. Call isn't a lone vigilance game. It's about highlighting issues and ensuring they are managed promptly. Help is around the corner remember our hero. Aim to restore peace, but first they need to show up triage tactics. Sometimes the issue might be outside your power to fix. Your mission is to gather the basic facts. If the problem is in your wheelhouse, jump into action, triaging and either mitigate or resolving it. Work on the problem till a solution is in sight or until you've handed it over to the right specialist. In this case, make sure to perform a warm handover. Remember, you own the scene, stay on scene and lend your expertise until the incident concludes. The primary goal when facing an incident is service restoration. Gather the required intel, form a strategy to mitigate or resolve the issue, and rally everyone who can help. Prioritize minimizing the impact on customers before diving into investigation mode. If mitigation isn't in the cards, aim for resolution. Time is of the essence after the temps of an incident comes. Whether through a walkaround, a rollback, or direct fix. Our journey continues into a meticulous exploration of its depth. It's time to dissect the event and discover its root cause. This is also not a lone venture. It calls the assembly of your allies together in unity. Delve into the enigma, investigate, and piece together the story behind the chaos. Post incidental reflection is crucial. After the dust had settled, it's important to learn from what happened. Conducting a lesson learned also known as postmortem or take in detail an incident review that sketches a timeline of events. Analyzes the root cause, employing strategies like the five whys outlines action items with designated owners and etas. Draw up tasks for future handling to ensure every lesson morphs into a production measure. Post incident arm yourself better with prevention protocols craft runbooks documentation based on the newfound knowledge, leading to quicker triage in the future. Where possible, develop tests and automations replacing the incident, replicating the incident scenario, and refine metrics and alerts for visibility. Every incident thus morphs into a shield against future troubles. The roles of leadership in this realm are diverse. Initially, they must ensure the engineering manager schedule the oncall rotations. Following that, it's imperative to provide guidelines and establish clear expectations around the process. Occasionally, evaluating and adjusting the process is necessary to ensure it meets its objectives and aligns with the company's context. Moreover, determine the severity and slas of incidents is crucial task. They also formulate an escalation policy that aligns with the severity of incidents. As an engineering leader, stepping into the role of a superhero, your main mission is to protect your team from oncall burnout. Make sure to shield them from too much work and to always respect their time. Keep late night shift strictly for real emergencies, knowing that many issues can afford to wait. Balance is essential. Put your team's well being above constant hard work. Some of my proudest leadership moment come from identifying a situation as noncritical and telling my heroes to stand down when done right. This is not only avoid burnout, but also builds trust and results in a team that's more engaged knowing you'll only call them when it's truly necessary. The key to unlocking the full potential of our heroes lies in cultivating a positive and supportive oncall culture. There are several do's and don'ts, but we'll focus on the crucial ones. Avoid a blame game. It's not about pointing fingers at other teams or passing incidents around like a hot potato in the heat of the moment. It doesn't matter who started it. A blame culture can determine the individuals from taking risks or owning up to mistakes. Stay proactive. Don't just wait for solutions to come to you. Engage actively. Exhibit a sense of urgency. Understand the essence of your call to action and take charge of it. Communicate effectively. Ensure clear communication with your teammates, handling the incident, your manager, and with your stakeholders. A well informed team is a well coordinated team. Embrace responsibility. Tackle challenges in a respectful and kind manner, embodying the true hero within you. Navigate through the storm by cooperating with others, working together to restore calm in our journey today, we review the encore routine and expectations, delving into the hero's journey. From vigilant monitoring to prioritizing service restoration during incidents and deep post incident analysis to shield against future issues. Leader emerges vital curating oncall schedules, clear expectation and thoughtful policies, always with a keen eye on team well being and balanced workloads. Our exploration also illuminated the vital oncall culture, championing solutions focused, blame free environments. Together, let's embody our inner hero, fostering and empowering oncall culture. Now it's our collective mission to forge our own league of Guardians. Visualize your team as a robust league of heroes, each engineer possessing unique skills standing resilient against the chaos of incidents. Your mission invest, ally, and share. Cultivate practices that uplift your engineers. Forge alliances within your ranks and share your war stories of trials and triumphs. Embark on this quest build your league, share your sagas, and redefine the future of encore duty. The adventure to craft a league of extraordinary encore heroes begins now. Thank you.
...

Einat Mahat

R&D Group Leader @ Payoneer

Einat Mahat's LinkedIn account Einat Mahat's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways