Frictionless Incident Mitigation

Video size:

0:09 Miko Pawlikowski

Hello and welcome to Conf42Cast, the favorite podcast from another galaxy. My name is Miko Pawlikowski and today my guest is Robert Ross, the CEO at FireHydrant. How're you doing, Robert?

0:22 Robert Ross

I'm doing very well. Good to see you again.

0:24 Miko Pawlikowski

Thanks for coming again. So, you know, traditionally, we're going to start you with a weird question. And the question of weird today for you is, if you knew for sure that there were aliens somewhere, would you actually want to meet them?

0:37 Robert Ross

So I, I've thought about this many times, actually, in conversations with friends. And I think my answer is: no. Because I don't know how I would communicate with them. I don't even know if they're nice, right? Is it like to see what they look like, I don't need to meet them for that. I would just like kind of probably go to the news to see that. So no, actually, we're gonna stay distant friends.

1:01 Miko Pawlikowski

That's fair enough, you know, you can just connect on Facebook. Yeah, what if they're not so nice.

1:07 Robert Ross

But what if they don't have faces? And then it's not Facebook anymore, so that might not work, actually.

1:13 Miko Pawlikowski

You really did think about that.

1:15 Robert Ross

Yeah.

1:16 Miko Pawlikowski

So many questions. Okay, so our guest today is somewhat of an incident management expert. So I think it's only fair to start with a question, do you think we have finally figured out how to do the Incident Management right?

1:35 Robert Ross

I think the answer will probably always be: no. I think we're getting a lot better at it. And I think the I think a lot of businesses in the space are realizing that incident management is a core competency that businesses must have for them to be successful in this kind of online world that we're all in. Everyone's working remotely. Everyone is expected to use software to do their jobs now. You know, it used to be folks in banks were using software. But now we're seeing software even in farming, we're seeing software in building construction, and it has definitely eaten the world. And as software gets more and more complex, it solves more and more complex problems. That means that incidents are just going to be always changing and verifying and managing them will always be changing as well. So I don't think we'll ever have figured it out, there always will be a new challenge with software and reliability that is required of it now.

2:42 Miko Pawlikowski

So if you plot a trend of you know, how much we have figured out versus complexity, do you think the ratio is increasing? It's getting better? Or we're just kind of trying to catch up with the increasing complexity of everything that we write?

2:57 Robert Ross

Well, that's a good question. I think that software complexity is, is always increasing, it's been increasing substantially, especially even just in the last five years. And I think that's due in large part to technologies like Kubernetes, and, you know, all of the cloud services that are being released new paradigms, like serverless. So complexity is increasing very quickly. What I think is increasing even faster, though, is reliability expectations of software. If you kind of look at the space right now, we all treat software like a utility, like electricity, like water. And that means that the tooling and the reliability requirements for these businesses is far outpacing I think the complexity now. And if it's not, I mean, at some point, you're going to either have a 'coming of Jesus' moment for you change direction. Yeah, I think that they're definitely getting closer to each other. And I think there will be an intersection at some point.

3:59 Miko Pawlikowski

Do you think there is a limit though? Like, do you think at some point, we'll be like, 'Okay, that's enough complexity, we should like, stop and start doing things differently,' or is it just like always going to be increasing?

4:17 Robert Ross

Well, I think that there's a XKCD comic that kind of talks about USB, and it says there's this strip where it's like, different variations of connectors. And then we introduced like USB C, there's still plenty of things that are running on USB 2.0 with the rectangle, alright. And yes, like, will there be a software catalyst and change in how we build software? Absolutely. And that's inevitable. But what's also inevitable is that the software that we use today is going to persist as well. Banks are still using COBOL you know, you want job security right now, go learn COBOL because it's not going anywhere, still running financial institutions. I think that the complexity will continue to go higher. And I don't think that we're going to hit this moment in time where we're like, this is too much complexity. We have to change all of our software in hopes of like reducing complexity. Now we just have two different types of complex systems is what I think will happen.

5:21 Miko Pawlikowski

It's fair enough, until quantum computers come in, and they completely shift the paradigm. I was trying to, like, catch up on where we are with that recently. And I realized, I don't know anything about it. It sounds a little, sounds like it could be a real paradigm shift. But right now, it seems that is a bit exotic and esoteric. I was gonna also ask you like, you know, if you were to compare where we see it in terms of incident management today, and 10 years ago. And 10 years ago, being a completely arbitrary number, but you know, long enough that it should have a visible difference. But not long enough you were too young, back into group.

6:05 Robert Ross

I was writing software longer than 10 years ago. And I do think that incidents have evolved in the management of them. If we think about 10 years ago, let's rewind back a little bit. What was it, 2012. That was kind of earliest days in the DevOps movement, like it was a few years earlier, I think is when some stuff started popping up. But we got faster at deploying software to production, we were deploying on green CI builds. And that was kind of a new thing, back in 2012. And I think that the approach, then, when you had an incident, you had a much less complex system, microservices hadn't really, you know, taken off at that point. It was kind of one process monolithic applications running on either a Rackspace server, or if you were lucky, EC2, or you know, maybe you were running a server in a closet, like I was in 2009, at the first job that I had. And at that point, the response mechanism to an incident was like, restart it sometimes. And we didn't have a lot of the tools, how we manage incidents today, like adding more disk capacity 10-12 years ago, wasn't easy for most companies. Right now, with a managed database, FireHydrants map managed database scales up for us, we don't do anything. So incident management has certainly evolved, because the tools that surround the systems that break have evolved. So now the incident management process for these businesses is finding the right people to get into the room. Because these distributed systems have different owners across multiple time zones, managing different types of software, one application's Go, one application's Ruby, another one is maybe in Rust, maybe you have serverless things written in JavaScript or TypeScript. So because the complexity of these systems have changed, the incident management aspect of it has to change 10-12 years ago was monoliths. I feel pretty confident saying like, that was the paradigm of choice. And I still think that's a lot of the world today.

8:10 Miko Pawlikowski

And when you said a way of fixing things was to restart it. I actually had a conversation with a friend about this recently. And with Kubernetes, it tends to be very similar. We just say, 'Oh, this is not a pet, it's cattle, so it's okay to restart it'. And we want to automate that.

8:28 Robert Ross

And Kubernetes will do it for you a lot of the time, right? Kubernetes will say, 'Oh, memory exhausted'. And it'll, you know, take that pot out back and give you a new one. We've just made restarting things easier.

8:40 Miko Pawlikowski

Yeah, it looks much better when Kubernetes does it. Okay. So, since we reminiscing like this, obviously, the key word that comes up more often than it probably should is the 'reliability'. And I was wondering, from your point of view, you know, if you're going to have exposure to people talking about this and dealing with that from all kinds of places, how would you say that perception of what it means to be reliable evolved in like, the recent years, or maybe the same 10-years timespan, what would not cut the master today as reliable, that would have worked perfectly fine in the past?

9:17 Robert Ross

I kind of think, if you look at the software, if you look at Google Chrome, or Firefox or whatever, you know, whatever your browser of choice is, there's a big button on all of them for refresh, and that's still there. And it was there a long time ago. And I still have the muscle memory when I was on a Windows machine hitting F5. And now I have command R to refresh the page. And I think that the reason I'm going down this might seem like a strange storyline here is that people and customers kind of understand that software just might not work the first time around and you have to try again. And we've actually been kind of programmed this way as human beings, and even just in society, what happens when you go up to a door to let's say, a coffee shop and you push and it doesn't open, what do you do? You pull, you change your strategy, you try a different way, because you still want your coffee. And what's happening with reliability is, you know, 10 years ago, is we have this refresh button because we either wanted to get new data onto the page, or maybe we hit an error, and we just are willing to try again. Now what we're doing is we're making things retry for us much of the time, with distributed systems. Retrying is a real strategy. Sometimes it doesn't work on the first try. And you just kind of try again. What's happening now though, is as software evolves into this utility, like I mentioned earlier, like electricity in the walls and water from faucets, it will have to not fail more of the times. Reliability, the perception of customers and people using software, it is becoming a no-fail scenario for a lot of different tools. Just last week, Slack had a very interesting incident where the platform was available, but kind of available and threads weren't working. And you couldn't mark a thread as read. And I saw tweets, I saw messages, ironically, in Slack. I saw Instagram posts about this, and it drove everyone mad. Because this was so ingrained, Slack is so ingrained for many people's daily lives, and the way that they work, that when slack wasn't changing an unread portion of the app that it was showing that things were still unread when the person had definitely read them, it drove people mad. 10 years ago, that probably wouldn't have been the case, it probably would have been like another thing. It's a little annoying. But it's because software just wasn't so pervasive in our daily work lives, we did a lot of other things back in 2012. The most of the world if you're a software engineer, what I'm saying right now might not resonate. But we have to understand that most of the world in 2012 was still just kind of getting started on a lot of we do work in software. So I don't know, I think that reliability expectations have changed, because customer expectations have changed.

12:22 Miko Pawlikowski

I did notice that I my messages were in being marked as read. It was really weird. It was almost like, you know, a part of me was just unable to communicate. So yeah, definitely can relate to that. I think it's probably a good time to do a little plug now, a little FireHydrant refresher. You know, we've spoken a little bit about reliability and incident management. And obviously the question now is, so what are you guys doing about that?

12:51 Robert Ross

Yeah, well, we're busy. I mean, so for folks listening, I'm CEO of FireHydrant, former on-call engineer for a number of companies. And FireHydrant's unique offering is that we're full lifecycle incident management. And we help teams learn from their incidents, and really convert incidents and the unhappy times of reliability into learnings so that organizations can push all of that back into the beginning of the software development lifecycle. The way that we do that is we integrate very tightly with all of your tools, Slack, JIRA, Zoom, PagerDuty, if you're on Opsgenie, vectorOps, or Splunk On-Call now, and we allow people to define what their incident management process should be. When you declare an incident, we will open a Slack channel, a JIRA ticket and create a Zoom bridge for you. And that will record everything happening in the channel. So when you get into your post mortem, or retrospective, whatever you'd like to call it, all of its available, and you're not spending time copying and pasting from a channel into a confluence doc or a Google Doc. At the very end, we'll create all the follow up action items right inside a FireHydrant into JIRA. And then we allow you to see well, what SevOne incidents have outstanding action items from the last three months. And we really are helping organizations manage the ad hoc freak outs that may come from incidents, bring consistency to the whole organization. We allow people to define service owners and allow them to manage their own processes for their own services. And ultimately, we're just trying to help organizations just achieve a much higher degree of reliability for them and their customers.

14:37 Miko Pawlikowski

So, I think the first question that every time I hear your pitch is like, how do you convince people that introducing a new dependency, that obviously does a lot of this things that you just describe and makes things better when it works, is not going to introduce more trouble. And then maybe if it could talk a little bit about how you ensure that you know, you being the thing that really has to work where other things stop working during an outage is actually happening? Because that sounds really, really cool.

15:11 Robert Ross

Yeah, I mean, we definitely have to be super reliable at FireHydrant, our expectation is four nines of availability, very high, actually. And the way that we kind of talked about the problem and yet another tool, that is fatiguing for organizations. The reality is that we're a single pane of glass into all the other tools, declaring an incident and all the other tools that you like to use during incident management, we're the connective tissue to all of them. Someone creating a Zoom channel and a JIRA ticket and notifying via PagerDuty, as well as creating maybe a document to start tracking things. And that's actually a good amount of work. And while yes, a lot of that work might happen in those tools. The reality is that minutes matter in incident management, and FireHydrant's run books, and all of the automation that comes from the service catalog in the product as well, is time saving, which means that you're saving money, and you're saving customer trust that would be lost otherwise. And building customer trust is you gain it in drops, and you lose it in buckets. And if you don't have a way to reduce the pain of incidents for yourself and for your customers, I mean, you just have to have a tool like that. And I think there's a world that will start to expand the product offering, it's certainly not just a tool that has to integrate with everything else, it has a value prop well beyond that. It just gets even more powerful when you bring in the tools that you're already using integrate them with FireHydrant

16:47 Miko Pawlikowski

How long have you been using FireHydrant to debug FireHydrant?

16:51 Robert Ross

Day one, actually. We do use FireHydrant at FireHydrant. We are of the opinion that open incidents for the most mundane of things. So if there's a customer escalation for a problem they're encountering, we'll open an incident for that. If we see something that's just a little off, little hinky, we'll actually open an incident for that too, and call it and we set the severity to investigating. So FireHydrant has had one major incident this year, that was a few minutes long. We have well over a 100 incidents opened in FireHydrant from this year. So I think that our philosophy is, if you want to increase your uptime, open more incidents. Because if you're opening incidents for the small things, you're gonna get really good at opening incidents for the big things when it really, really does matter. And yeah, so that's my spiel on how FireHydrant is using FireHydrant. We've written a blog post about that as well. We talked about, was on the new stack, like how opening more incidents of lower severity has improved our reliability.

17:55 Miko Pawlikowski

So a lot of our audience is going to be in some kind of variant of a software engineering/SRE kind of person. And I guess my thinking is that might be a little bit daunting, if they wanted to give it a try. Because, in my head at least, you start getting value when you plug in all these things. How do you recommend getting started with that in your own company, if management or whoever, and the CTO hasn't blessed it yet as a tool of choice?

18:28 Robert Ross

That's a good question. So we do come with a free tier. So there's you don't have to contact sales, we have an amazing sales team, they're very friendly. But if you are someone that likes to play with something, take it for a test drive on your own, we have that tear available. I've seen people create just a temporary Slack workspace as well. So you don't have to, you know, try to get blessing from IT, you know, in a very large organization that can take, you know, a couple of weeks, and you know, install it in that organization just to see I want to try to open an incident and use this tool. Also getting started. I mean, what process do you bring to the table is a question that we hear quite often is, I know I need a better incident management process, I don't know where to start. So inside a FireHydrant, there's a starter runbook that will do what best practices are based on our experience as engineers on on-call as well as working with our customers. There's a pretty well defined set of steps that we see in most runbooks that we offer as a starter template inside a FireHydrant, too. So that's been a really great way then at that point, talk to sales or sign up for the essentials tier, swipe a credit card and get started.

19:39 Miko Pawlikowski

What about the future? What's coming? Can you share any sneak peek of any upcoming features or in general, maybe any upcoming trends that you expect to be important going forward in the next, say, 12 months?

19:55 Robert Ross

Yeah, when you when you started that question, I was like, do we, is he asking me what's happening in the future. I was like for FireHydrant, for the world? I mean, jokes aside FireHydrant, we have a lot of stuff planned in the coming years. But notably, I think that there's a piece of incident management that we've been doing for a long time. And we're just up-leveling it even more right now, which is status pages. For most companies, you might have status.fire hydrant.io. And that is the place that customers go to receive updates. StackOverflow hosts their status page on FireHydrant. So if you go to stackstatus.net, that's hosted on FireHydrant. And there's just so many things, there's so many ways to innovate on status pages actually, and I'm really excited for some of the work that we're doing here. We have something like customer specific status pages. Imagine being able to give your largest customers their own status page. That's where we're going with the product. And I'm very, very excited about that. Because good communication kind of conquers a lack of trust, and actually can build substantially more trust fastly. You had a major outage and their stock went up the next day, largely due in part to because they manage the incident very well, that communicated about it exceptionally well. And it built a lot of trust with people instead of losing it. So I'm excited about that. I'm also excited about just all of just the amazing UX work that's going into FireHydrant, we're a little over four years old, and if you can believe it, we're the oldest kids on the block. We have a lot of incredible UI UX enhancements coming to the tool, many of which are in the command center right now. The post mortems tool is getting refreshed, rather retrospectives is getting a refreshed all the habits die hard out that word. And just in general, the product is becoming easier to use. It's getting a lot of enhancements just across the board in the service catalog in incident management and retro status pages runbooks. It's a very large product area. It's a full-blown reliability platform. And it's cool to see it just get new coats of paint all over the place.

22:03 Miko Pawlikowski

Definitely. And what about like the wider community, you know, SRE, whatever that means to whoever's listening to this. What trends are you seeing? And what trends are you expecting to see in the near future?

22:17 Robert Ross

You know, it's a little interesting. Maybe I'm not looking in the right places. And I'm just gonna caveat with that, initially. But I feel like I'm tapped in to most of the sources of where SRE is, and kind of watching the market. I haven't seen any, like earth shattering things happen in the SRE world. If it gets a well understood idea. And I think that the execution is where people are focused on right now, like, what does an SRE team at our company mean? I've seen a lot of companies re-label their DevOps teams, to SRE teams. I'm starting to see this other trend. now. Maybe this is the earth shattering thing I just forgot to say is that SRE is actually quickly becoming like platform engineering. Which I don't know how much sense that totally makes, I think SRE still has a place in the world, you should still have people that it's their job to engineer things for the purpose of reliable software. But I have seen companies evolving some of their operations teams, SRE teams, even at these platform teams. Honeycomb just actually had a great post about this, as well. And I agree with it, I think that teams are going to have to start to build these platforms, these internal platforms, to accelerate the way that they build software to make it more reliable. And to provide a really kick-ass customer experience. It's all kind of I think the platform engineering play is where the puck is going right now.

23:47 Miko Pawlikowski

Definitely. That always makes me think, though, how much of not invented here is going into this platforms where people take Kubernetes, and they write a layer on top, or they do another way of doing continuous integration or something like that. And how much you see a healthy balance between, we obviously want something that works well in the context, and also at the same time, we don't want to reinvent the wheel every time. Do you have a position on that? What's the golden ratio?

24:24 Robert Ross

A good platform should make it really easy to build, ship, maintain and keep software reliable. That's what a good platform should do. And it should hide many of the complexities for these applications that you're building. For example, I mean, you should be able to launch an application with a manifest that says, 'Here are the things I need to run. I need a database, I need a key value store', right? Maybe it's Redis. Maybe it's Memcached. I need Postgres, and then basically the platform is like I got you, I will create those things if they don't already exist for you. And I'll give you the credentials for them, you just you build your application and run it, you don't need to think about a lot of these other things. I think that's what a good platform is. Where I think platform engineering will have a very hard foothold, or it will be really hard to gain ground, is really good documentation. And really good ease of use, like developer experience is what is going to kill an internal platform. And I think a lot of businesses will hit some patch of mud as they try to launch these internal platforms. There are a lot of successes of platforms out there. I mean, if you look at Heroku, I think that is definitely a perfect example of what a good platform can be. Although it was relatively simple, it didn't give you a whole lot more for many years, other than a database. And then they added manage Kafka, they added manage Redis, they added all these amazing things. But as companies move more and more towards their own internal platforms, I do think the dominant choice will be something like Kubernetes to build on top of and I think that that is going to be hard to educate whole swaths of engineering teams on how to best use it. And there's going to be a lot of opinions on how to build and manage them. I think it's a great idea. I think the implementation of it from a software perspective is very achievable. I think the rollout to large engineering organizations will be very, very difficult.

26:28 Miko Pawlikowski

Very good insights. Okay, so for everybody who wants to learn more about FireHydrant, ttheyo go to firehydrant.com. They can check out the blog. What else should they do?

26:39 Robert Ross

You know, the best way to look at FireHydrant is, obviously go the website. There's a great four-minute video that one of our team members put together. He takes you through a really cool example of most of the tools, it's about four or five minutes long, it's really short. And you know, create a free account. It's free, you can integrate Slack, you can have, like it's up to 10 users on there, you can use part of the service catalog, you can use the runbooks. It's a really great way even if you're a small team, and you're looking for a tool to just help quickly manage incidents, free tiers all for you, if you're really big organization want to use FireHydrant, we work with teams of 2500 engineers on the tool at that point, you know, reach out to sales, they'll give you a kick ass demo to and tell you what's possible.

27:22 Miko Pawlikowski

No pressure sales, no pressure. And for everybody who wants to follow you personally, as you know, The Adventures of Bobby Tables was the best way to follow?

27:31 Robert Ross

"The Adventures of Bobby Tables", yes, my online moniker. You know, twitter.com/bobbytables, if you want to look at some old GitHub repositories, github.com/bobbytables. That's where you'll find me most of the time on the internet.

27:47 Miko Pawlikowski

And before I let you go, the last thing, in one sentence, what is it what's the skill, technology, language, methodology that you would recommend checking out or investing your time in 2023? Because hey, 2023 is coming.

28:04 Robert Ross

The best software will have heavily leverage the open closed principle of the solid principles. I will die on that hill.

28:17 Miko Pawlikowski

That sounds like a Zen sentence to ponder. And we'll leave that as an exercise to the audience. Thank you very much. Once again, Robert Bobby Table, the CEO at FireHydrant. It was amazing to have you here again. Thank you so much.

28:31 Robert Ross

Thanks for having me. Cheers.

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways