Conf42 Chaos Engineering 2021 - Online

Organizational Chaos and recipes for Service Ownership

Video size:

Abstract

Service Ownership can mean a lot of things in a growing engineering organization. The advent of microservices has made it more critical to get right. In this talk, we’ll talk through all the way your organizational can cause operational chaos before you get Service Ownership correct.

Before you begin failure injection or chaos engineering, it’s important to have clear Service Ownership defined within your organization … else an even greater level of chaos can ensue in your infrastructure and between your teams. In this talk, Joey Parsons from effx will share recipes for defining Service Ownership within your engineering organization and why it’s critically important for chaos engineering and your overall engineering success. With stories from his new startup effx, and previous companies Airbnb, Flipboard, and others, he’ll share strategies and stories from what’s worked well and what hasn’t.

Summary

  • Joey Parsons is a longtime veteran in the infrastructure and reliability space. He's led engineering and operation teams for the last 20 teams at a handful of companies. These days he's building a platform to help engineering teams better operate and use microservices. He'll talk about how recipes for service ownership can overcome those challenges.
  • Service ownership is more of a mindset of responsibility that gives you freedom as well as autonomy. It means that teams are fully autonomous and responsible for the delivery of their service component or piece of functionality. Service uptime matters, so the on call part of ownership is naturally critical.
  • What happens when service ownership doesn't go right? Organizational chaos. Before you can even think about running failure injection or setting up game days, you need to know who to involve in a world of microservices.
  • No unowned services ever. Everything needs to have an owner. It's important to defining can owner for each service and keep it up to date and maintained. Without service ownership maintained, this sort of thing that's funny is bound to happen.
  • A service cannot be owned by two teams. You need the single team to contact. Every engineer of a team is part of the ownership, not just the sres or the ops. You should be tracking production operational readiness constantly.
  • A goal for every reorg should be to reassign, assign, or reassign every service or piece of software in existence. Always think about how to do service ownership. The folks on your team are the most precious part of your organization and company. Don't forget them.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hey everybody, things for coming to my talk about organizational chaos and how recipes for service ownership can overcome those challenges, I'm really excited to share. So let's get started. So who am I? I'm Joey Parsons. I'm a longtime veteran in the infrastructure and reliability space. I've led engineering and operation teams for the last 20 teams at a handful of companies. You've probably heard of running anything from platform and backend engineering teams to old school operation teams. Fun fact, I've actually carried a literal pager for on call for work and more recently running site reliability. At Airbnb, I've seen how folks organize both their people and their infrastructure for success. These days I'm building a company called FX where we've built a platform to help engineering teams better operate and use microservices, including helping them overcome the human and cognitive load challenges that can occur, including service ownership. What I'm going to talk about when it comes to the topic today comes from my years of experience as well has from our customers who come to us to solve some of these challenges. So what is service ownership? It's a pretty broad topic, to be honest, and there's plenty of opinions in the space that are all generally correct, and I'm pretty glad that it's actually become a topic that people talk about. Some folks like to distill it down into simply, you code it, you own it. But it goes far beyond that. Ownership is more of a mindset of responsibility that gives you freedom as well as autonomy. It's not all just on call and deploys. So how do we define it? Service ownership means that teams are fully autonomous and responsible for the delivery of their service component or piece of functionality. It goes both ways. In order to have true ownership and have the autonomy over your work, you take on responsibility. And thankfully, the responsibility parts of ownership come with the freedom and flexibility and autonomy that we know every engineer craves when it comes to their work. And this includes whats is definitely not limited to incident response. So this is what most folks think of when they think about service ownership, but it's really just a part of it. But ultimately, it's a pretty big part of the responsibility side of things. Service uptime matters, so the on call part of ownership is naturally critical. But it's not just on call. It's managing observability, monitoring, and ensuring that everyone has all the information they need to take action when something dire happens. Performance and efficiency is pretty self explanatory, but as a service owner, you're responsible for making sure that a service runs fast or slow, I guess, as it needs to be, tuning for performance and efficiency of resources falls upon the service owners building features and tracking bugs, the most obvious of all these. But it's critical to call out who do those Jira tickets get assigned to. Capacity planning costs may matter. May not matter in some companies, but in others it's absolutely critical. Keeping track of your resources and using them appropriately is key in any low margin business. In some orgs, it's simply just a measure of dollars, but in others, this is something that can show up and funnel up to A-P-L report. Last but not least, failure injection and chaos. Who's running game days for a service? Who's ensuring that it's fault tolerant to the rigors of your platform, compute and user behavior? More often, it's a critical part of an engineer's workload. So speaking of chaos, what happens when service ownership doesn't go right? Organizational chaos. So this is comp 42 chaos engineering. Before you can even think about running failure injection or setting up game days, you need to know who to involve in a world of microservices. Imagine trying to run a game day where you're going to take part of the info down. Let's say it's a really critical database that a lot of services need to engage with through its presentation service. And this may have repercussions for a lot of services. If you're unable to figure out who to coordinate with for those other services for that game day, you end up getting stuck. Frankly, beyond that, without a sense of service ownership, your engineers are struggling day to day on a day to day basis to probably get anything done in their own worlds, much less cross team or infrastructure wide chaos inevitably ensues and engineers are left distraught that the people side of things simply isn't set up properly and look to management for answers. So what are all the things that can cause organizational chaos without service ownership? First and foremost, unowned services, this is probably the worst thing that can happen. Poor ownership is replaced by nonexisting ownership, and it leaves all the responsibilities of a service owner to nobody, and no one knows who to talk to. Secondly, no easy way to contact a service owner. Knowing who an owner is is one thing, but how do I get in contact with them? What's their slack channel? What's their email address? How do I page them if there's an emergency? If it's not easy, it's not getting done without engineers being, frankly, absolutely livid. Poor documentation or runbooks part of ownership is making sure that it's up to date and solid so that not only can your team follow, but others can if they ever need to operate bus factors of one this usually comes hand in hand with the hero complex. I'm sure we've all had it before, but there's usually that one engineer who knows everything and can operate every system or service. Yet when they leave the team, switch roles, leave the company, or frankly something terrible happens to them, the rest of the is left holding the bag. Their knowledge needs to be dispersed to a team or team to prevent their own burnout, but also to share the load and responsibility. Make this a priority to end if you can broken or missing on call rotations I bet we all have hilarious on call rotation stories. I remember getting paid for a startup I hasnt worked for in over two teams because something esoteric broke. Yeah, without service ownership maintained, this sort of thing that's funny is bound to happen. There's plenty more where all that came from and I'd love to hear your stories, but I want to run through some tenants or even recipes for how to ensure solid service ownership at your company. First thing. No unowned services ever. None. Everything needs to have an owner. How many of you have felt that story where there's this critical service that everybody knows exists, but folks have played hot potato with it for years to where when there's a security issue, nobody quite knows who to reach out to. That's an unknown service and probably the worst thing you can have for your culture. It's important to defining can owner for each service and keep it up to date and maintained in a place that's easy for folks to consume and digest. Which brings me to my next point, which may be a little controversial. One service for one team. A service cannot be owned by two teams. When two teams own it, nobody does. You need the single team to contact. That's well defined. Imagine being in a situation where something dire is happening, security issue, incident response, something like that, and you're unsure of which team to reach out to. There's a decision point that shouldn't exist. Sure, you could ping both teams. Then you're likely to get someone involved who doesn't need to be one of the tenants of service. Ownership is being autonomous, and you can't be if you need to coordinate elsewhere. One team to one service ratio never toss over the fence, even within a team. So the embed model for sres that's gaining popularity is becoming quite prevalent, especially if you're able to hire these sorts of highly trained folks. However, if you're just replicating the pre DevOps model of separating dev from ops, where they're the only ones who are dealing with the hard parts of ownership, you're doing it wrong. Every engineer of a team is part of the ownership, not just the sres or the ops. Every engineer needs to be on the on call rotation, and the sres should be up leveling everybody's ability to do it better. You should be tracking production operational readiness constantly and always. Constantly and always. Part of true service ownership is having quality documentation, whether that be runbooks, API specs, general documentations, observability dashboards, and more. Having this information readily available makes ownership easier, not just for you, but for the other folks who may need to operate your system in a pinch. And don't just do it as part of a preflight checklist before a service is launched for the first time. Do it constantly and make sure everything is kept up to date. And when it comes to tracking readiness, measure what matters most to you every organization is different, and you may care deeply about having in depth runbooks, while a company under regulatory compliance may care about something else more like having a security scan run every day. Build a mechanism to measure what matters most to your organization, not necessarily just taking the cookie cutter approach from other companies you may have heard from, but contact information should be easy and simple to find, and if it's not, it needs to be incredibly intuitive. For example, let's say you have a team in your engineering called user growth. They're responsible for building services and functionality to help grow your user base. Simple enough, but what if their Slack channel is something called hashtag scale and their email address is just growth app? No one's going to know how to get in contact with them, keep things simple and make it consistent, so else your engineers are just left floundering. Speaking of keeping things simple, on call rotations should be intimately tied to your user directory system, whether that's Google, LDAP, active directory, whatever you use. Otherwise, you end up with too many inconsistencies between two groups, and even funny things where people have left the team or a company and still are on call for your service. Let's talk about Reorgs, the absolutely dreaded reorg. Nobody loves reorgs, and they're usually done quite in haste, but often, especially with microservices, the people get reorgs, but the microservices don't. A goal for every reorg should be to reassign, assign, or reassign every service or piece of software in existence so that the negative ramifications of service ownership are mitigated from the start and you don't end up with that one person owning a service, or even worse, unowned services where people just shrug and say hey, I guess I used to own that, but I switched teams so I don't anymore. Good luck. You never hasnt that to happen. So make sure you reorg all of your services too. Anytime you do an organizational reorg, ownership is a challenge it's definitely not easy to get right. There's probably a jillian other recipes I could have mentioned today that would help all of you, but hopefully this is a set of rules that you can live by and it can be a start for you. One last thing to mention at the end here is that the folks on your team are the most precious part of your organization and company. Always think about how to do service ownership and what you can do or tools you can implement to help them do it better. You want them building products and software to help your end use. Not thinking about tracking down a person, a service, a dashboard, an owner and leaving the day frustrated? Think about the owners. Don't forget them. Thanks again for listening. Feel free to hit me up at at Joey Parsons on Twitter to chat. Have a great day.
...

Joey Parsons

CEO @ effx.com

Joey Parsons's LinkedIn account Joey Parsons's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways