Conf42: online tech events

Video size:

0:09 Miko Pawlikowski

Hello and welcome to another episode of Conf42Cast, the tech podcast from a neighboring galaxy. My name is Miko Pawlikowski and I'll be your host. Today with me I've got Kurt Anderson, the SRE architect at Blameless. How're you doing, Kurt?

0:23 Kurt Andersen

I'm doing great, Mico. Thank you for having me.

0:26 Miko Pawlikowski

It's my pleasure. So a lot exciting ground to cover today. But before we even go there, what's the best thing about space travel?

0:34 Kurt Andersen

As I was thinking about it, I think it's got to be the view. Once you get above the atmosphere, you can either look back and see that those amazing views of Earth, or you can look at and really see the stars with no interference.

0:49 Miko Pawlikowski

No light pollution at that stage.

0:52 Kurt Andersen

Exactly.

0:53 Miko Pawlikowski

Okay, so you sound like one of those people who would be keen to pay Jeff Bezos for a little trip. Is that right?

1:00 Kurt Andersen

Well, I don't know if I want to pay his price.

1:04 Miko Pawlikowski

Just a product. That's a very good answer. So I was very excited when I was kind of scanning your background, because there's a lot to cover. And it fits very nicely with our space-themed everything at Conf42. So I wanted to ask if you would like to give a quick introduction about yourself. And I would really like if the keywords like NASA, Jet Propulsion and IBM in the 90's came into play in that intro.

1:36 Kurt Andersen

So I started my career. Working at NASA's Jet Propulsion Laboratory, I had the great experience of working in the image processing lab there and working on several missions. Some of the last Voyager images were coming through at the time I was there, the primary mission that I was involved in supporting was the Magellan radar mapping of Venus. Because as you may know, Venus is covered with impenetrable clouds, so you can't actually see the surface. And the Magellan mission used radar, a synthetic aperture radar, to look at the surface in greater detail than had ever been done before. I moved from there, I actually went to Hewlett Packard and got experience supporting a manufacturing line with high volume production of inkjet pens. So that was a very interesting change in perspective. And then I went and worked for IBM's global consulting arm. They were divided at the time, they couldn't do managed services under the consent decree. So I worked at McDonnell Douglas supporting the the infrastructure there, at McDonnell Douglas, and got to see a bunch of really fast planes up close, but lots of desks also. And then from there, I went back to Hewlett Packard, I went back to the original test and measurement side of Hewlett Packard after that. And really got involved in supporting engineers with a reliable platform that they could do their engineering work on, which is really the foundations of Site Reliability is giving a reliable platform to people to do whatever it is that the system is supposed to accomplish. And then moved over to their managed services arm and help to build out and operate a major North American cable operators new email platform, and did the largest known migration of email accounts at the time. And then was hired by LinkedIn and got a whole new perspective on scale as we grew from about 200 million users, when I started there, to around 700 million when I left. So that was a great experience at LinkedIn. That's where I actually got the first title of SRE. Although I'd been doing a lot of Site Reliability work before that. And now I'm working at Blameless as a SRE architect to help build our product to help support SREs and everyone else who wants to deliver reliable experience on their product.

4:26 Miko Pawlikowski

That is one hell of a resume. That was actually at the Intrepid Museum in in New York a couple of weeks ago, I was just staring at the space shuttle and thinking, 'Gee, this is so cool'. This is just like a dream job to be working on that kind of thing. But I think I'm detecting like one kind of recurring theme in all of that was the reliability aspect and scaling it up and you got all the right names on your resume. NASA to IBM to LinkedIn to Blameless now. So I was wondering, you know, you mentioned LinkedIn was the first time when you actually got an SRE title. So could you talk to me a little bit about, from your perspective, or the last few decades, how this evolved into what we call today SRE, and maybe even how you define SRE, like 2022.

5:26 Kurt Andersen

As you may know, the the moniker of SRE was coined by Google in the early 2000s. I wrote a small report from O'Reilly called "What is SRE", intended to kind of fill in the gaps for management, people who wanted to know more about this profession. And it was kept as a relatively closed in-house secret at Google. But of course, Silicon Valley leaks people between different companies. So other companies in the Valley started doing this thing, because they saw that it was part of Google's secret sauce, if you like, and how they managed to make something that worked at scale and met people's needs. It was it kind of, SRE became a little bit more official with, of course, the Google Book, the SRE book coming out in 2016, I believe it was, maybe 2015.

6:27 Miko Pawlikowski

Something like that.

6:29 Kurt Andersen

Yeah, Site Reliability Engineering, and then followed by the workbook. And Google has, fortunately, I've had the opportunity to work with a lot of folks at Google and formerly at Google. And they've recognized that raising the level of performance and quality across the industry helps everybody, including themselves. And so it's not something that they should hoard to themselves. But they've become very, much more open, I'll say, over the last five years in terms of sharing content, and you can look at sre.google, I believe, is their link. And they've got all kinds of resources out there now, from some of the workshops that they've taught. They took them, they built them internally. And then they've taught them at SREcon conferences and other venues. So people can go there and learn some of the skills and learn some of these patterns of systems thinking that were not previously easily taught or learned. And they're not taught in schools, particularly.

7:33 Miko Pawlikowski

So before, you know, we all learned from Google to call it SRE. What were you calling that doing that job at the previous places?

7:42 Kurt Andersen

Well, you know, it didn't really have a recognised title. It was something along the lines of it might have been a systems engineer, it could have been a sysadmin, it could have been just a developer, who happened to focus a little bit more on how the systems performed in production, as opposed to just what features can I ship. And there's a convergence in the sense between the DevOps movement that was launched, if you like, around 2010, in that having both sides work together, and recognizing that, 'Hey, if there's a hole in the boat, it doesn't matter whose side of the boat it's on, the boat still sinks'. So we have to work together in order to bring everybody together and get a successful product. I guess, online delivery of software and the cloud to some extent, but just online software on the web, really changed things, compared to putting it in a box and sitting it on a shelf somewhere.

8:50 Miko Pawlikowski

Or sending to space, right?

8:53 Kurt Andersen

Or sending it into space. That's right. Yeah. And of course, when you're sending software into space, you really want to make sure you get it right. And because you often don't get a chance to do it a second time.

9:03 Miko Pawlikowski

That's right. So that was one of my questions I was going to ask you about was like, we all speak about reliability. But obviously it's a different thing when you can roll out a new version 15 minutes and fix an outage that affects people not being able to show off their new job. It's a very different kind of experience when you have satellites and rockets exploding and people dying, potentially. So I was wondering, you know, how different was it thinking about reliability at the different points of your career, and whether there were things that were the same? And what was the most different?

9:41 Kurt Andersen

I would say the timescale is really one of the biggest differences, because when we were working on missions, JPL, we were planning for things that were going to be 5, 10, 20 years in the future. And we were going through doing risk analysis prospectively, what is called today like a pre-mortem, for example. But we were considering things on a huge timescale. It's like, okay, in 10 years, what radio reception capabilities can we count on to be available in order to listen to this thing that we have sent out into space? And can we count on 24-hour coverage? How do we minimize any data loss that might occur on these, I'll say, amazingly low bandwidth connections, essentially, that we have from the spacecraft. And today, when you think about, as you say, you roll out a new version, you fix it, you commit it, you ship it into production, and it's 5 or 15 minutes, you don't have that level of planning, you probably don't necessarily need that level of planning. But you still need the ability to think ahead. And so the timescale is the biggest difference.

11:07 Miko Pawlikowski

A little bit of a side question. But I'm wondering when you get together with your old friends from NASA, how often do you talk about SpaceX and what they're doing? I mean, you mentioned you're on the Starlink link, which seems to be working reasonably well. What's the opinion because I know initially, it wasn't necessarily very well-received. But at this stage, from like, someone completely outside, it appears to be more or less accepted. What's the, you know, the prevailing feeling about that?

11:39 Kurt Andersen

Well, I don't actually haven't bumped into any of my colleagues from that long ago, recently. But I do know that there's some concern in the astronomical community as to pollution in their fields of view from Earth, with the proliferation of satellite constellations around the world. I'd say, since I live in a rural area, and had middling performance from a broadband connection, until I switched over to Starlink. It's been a huge improvement on my end. But that's of course, looking at it from a selfish perspective.

12:24 Miko Pawlikowski

That's right. Yeah. I mean, we're first experiencing for the first time, this kind of mass satellite installation in there a bit and I'm pretty sure, we'll find out interesting things about the side effects of that, at some point, sooner or later.

12:41 Kurt Andersen

Undoubtedly, and, of course, there were previous attempts at similar things like the Iridium systems for satphones, which ended up not panning out financially. So it'll be interesting to see over time how Starlink works out financially and technically.

13:01 Miko Pawlikowski

One question that I'm very, very curious about is, from your perspective, where do you see SRE evolving and going, in the might be medium to longer term. And for everybody else who hasn't had the chance to speak to Kurt, he forgot to actually mention his involvement in the USENIX. And the SREcon, which has been a committee coacher a number of times, right? So he's sitting and has a very nice view of where things are going. So if you were to try to distill, what's you your best guess is of where that might be going next, what would that be?

13:44 Kurt Andersen

Well, I think that in some ways, SRE, as a practice, will become a little bit more diffused into the gestalt of software. I'm not gonna limit it to development, but just overall software. I think that some of the tooling is becoming much more accessible, you get a bunch of stuff today, out of the box, if you go and use any of the major cloud providers. Which five years ago, or more, would have had to have been built at considerable expense and expertise in-house. So by making this tooling available for everyone, and especially I'm thinking of platforms, observability platforms, they really change the capability for the, I don't know, I hate to use the word 'average' developer. For every developer, I'll put it that way, to get in and see how their software performs in production. Now, there's still a mindset that they have to care about that. But at least they have the capability now, with some of these observability platforms, that they was always locked away behind special tooling before. So I think, to a certain degree, SRE will become a niche profession. In some ways. It's kind of like running your own email server, for example. 40 years ago, everybody did it. Well, not everybody, but lots of people ran their own email servers 40, 30 years ago. It was not any big thing. Today, if you talk to somebody and they say, they run their own email server, most people would look at them a little sideways and say, 'Well, why would you ever do that?' Now, there is a core of people who are experts and they spend all their time thinking and sleeping and drinking and eating email. And they run things like Yahoo's email platform and Google's email platform and Microsoft's email platform, or some of the smaller ones, like the privacy-oriented ones, the fast mail or proton mail, things like that. But it's a vanishingly small number of people doing that, similar to agriculture. If you look at how many people 100 years ago, were involved in agriculture, it was a significant percentage. And now that has shrunk over time, to a smaller and smaller number of people who can then still generate huge yields out of the land. But it's a very small number of people who are directly involved in agriculture anymore. I think, in some ways, SRE will be a similar thing. There will be a core set of people and skills in a smaller group of people. But it'll become democratized for more developers to be involved in the more common aspects of reliability.

17:10 Miko Pawlikowski

Right. Yeah, I think that definitely aligns with a lot of the stuff that I hear from other people. But there's also the question of, we kind of alluded to that before, the mindset part of it, right? We've got the tools that keep getting better. But you need that mindset of building things with the reliability in your mind. And I'm wondering if that is not like a bigger hurdle in the long term than just the tooling. What's your opinion?

19:07 Kurt Andersen

We actually straddle that tension, which is an interesting place to be. Our goal is to provide tooling that makes it easier for people to handle incidents, both before the incident occurs, during and after an incident, so that you can become more effective at responding to problems. Because problems will always occur. If you have an efficient response to problems, then you can reduce their scope and reduce their duration and make a better experience for your customers. Our goal is, the folks that have adaptive capacity labs talk about this above the line and below the line divide. Above the line is where the people are and below the line are the computer systems that do their things and we have to infer about that. Our goal at Blameless is to build tooling for the people side, to help the people be effective and efficient in responding to incidents. Not to do say, the observability side that would be probing the contents of the computer systems.

20:21 Miko Pawlikowski

So how is that different from the competitors? And you know, the other tools that I guess you could assemble a similar thing? What's the unique selling point for Blameless?

20:32 Kurt Andersen

From our point of view, providing this lifespan of an incident both before, during and after an incident perspective, is where we think the right answer is. And so in terms of before an incident occurs, we have support for SLOs and error budgets. During an incident, we've got the Incident Management Framework and support for chat bots, that will collect all the events that are going on for you. And then after, we facilitate a retrospective or a post-mortem, or an incident review, or whatever you care to call it, in order to learn from what went on before and during the incident. So that trans-incident perspective is our unique offering.

22:02 Miko Pawlikowski

That's true. Well, it can't work without Slack, apparently, anymore. That's the rules. So for everybody who wants to get started with Blameless, where do they go? What do they do? What's the least-resistance path to give it a try?

22:19 Kurt Andersen

The best way to go is to go to our blameless.com website. And we've got all kinds of materials there that you can read about SRE, you can read about Incident Management. And we've got several ebooks, that I'll provide links for you to put in the show notes. And there's a give-it-a-try button, I believe. It's at the top of the website, and you can click that. And one of our awesome folks will get back to you immediately, or really soon. And work with you to set something up for you to give it a spin.

22:54 Miko Pawlikowski

Sounds great, for everybody who needs to pause the show and go check it out. When you're back, what I'm going to do is try to extract two nuggets of wisdom from Kurt. And the first question is, what would be your advice to someone early in their software engineering or SRE career for the next big thing? Obviously. It could be a skill, a technology, a language, methodology, whatever you think is going to become increasingly important and useful in kind of like the mid-term perspective? If you had to pick just one thing, what would you recommend?

23:34 Kurt Andersen

I think that I would say the most important thing is to develop the skill of empathy. Being able to put yourself in the shoes of your customer, or the developer. Looking at problems from their point of view, or a variety of points of view, and being able to communicate that is I think that would be the key. That would be the killer app, so to speak, or the killer skill to develop.

24:01 Miko Pawlikowski

Well, that is such a human response. I have not had this one before. Empathy. Yeah, I think we all could use some more of that, on both receiving and giving ends. And that kind of ruins the second question for me because I was going to ask you now, what's the highest return on investment thing that you've done for your own career looking back. Would you say that was the empathy?

24:25 Kurt Andersen

Yeah, it's the empathy and it's also probably as far as some specific thing that I did that I found great benefit from was joining Toastmasters and improving my public speaking skills.

24:39 Miko Pawlikowski

I never actually tried Toastmasters.

24:41 Kurt Andersen

You should give it a try. They've got a pretty solid program. And it's surprising, people think about public speaking as in going to a conference and speaking. But if you're in a meeting with 10 people, that's also public speaking. So it comes in handy in all kinds of scenarios.

24:57 Miko Pawlikowski

That's right. So be empathetic and learn how to speak publicly. The most useful technical skills that we recommend here. Kurt, how do people find you? Do they go to LinkedIn? Do they find you on Twitter? What's the best way to kind of follow your adventures?

25:21 Kurt Andersen

I think probably LinkedIn, I gotta say LinkedIn since I worked there, would be my preference. I'm not hugely active on Twitter. I do have an account there. But on LinkedIn is probably the best place to follow. And I also write periodically for our blog at Blameless, so I try to cross post links there as I come up with them.

25:46 Miko Pawlikowski

Sounds great. And with that, thanks for coming on the show. Hopefully, we'll see you another time.

25:51 Kurt Andersen

Thank you, Miko. It was a lot of fun.

Limitless Reliability

Video size:

Awesome tech events for

Featured event

2025

2024

Info

Limitless Reliability

Video size:

Awesome tech events for