Conf42 Cloud Native 2023 - Online

Next-generation Enterprise Workflows: Unlock Observability at Scale

Video size:

Abstract

For the longest time, observability & APM integrations to enterprise workflow tools like ServiceNow have been unimaginative and lackluster: just toss a webhook over the wall and let them figure out what to do with that incident or alert. Join this talk to hear and see some real examples of what is possible with a rich, bidirectional integration between observability tools and enterprise workflow tools to enable previously unachievable outcomes with a few clicks and minimal setup.

Summary

  • This talk will focus on unlocking observability at scale, especially in enterprises. reimagining what's possible when integrating ITSM tools with observability tools. Whether you work in dev, Ops, SRE, ITSM leadership, or some other functional area, I think you might find some value in this talk for you.
  • Dev and Ops used to exist separated the wall in between that once the product was developed, it would be tossed over the wall for Ops to run. Service maps have become more challenging than ever, and modeling them using traditional systems of mapping is suboptimal. These are consistently some of the toughest integrations to get the outcomes I want.
  • An integration that I made for honeycomb to integrate to servicenow. Allows me to route alerts directly to serviceNow. Also import all of the entities that are being observed in Honeycomb into CMDB and map them into services. Allows for workflows that will help with troubleshooting, observing changes.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to my talk about next generation enterprise observability workflows. I'm Matt Morris. This talk will focus on unlocking observability at scale, especially in enterprises, by reimagining what's possible when integrating ITSM tools with observability tools. If you just rolled your eyes when I said itsm, two things. One, I get it. Two, stay with me. You just might be surprised. So after a brief backstory, I'm going to propose a new framework for these integrations and what we should expect of them with real examples. Whether you work in dev, Ops, SRE, ITSM leadership, or some other functional area, I think you might find some value in this talk for you. So let's jump in. So why are we talking about this? I think it's important for us to take a step back and think about some of the things that led us here between the time when ITSM was really in its heyday and now. First dev and Ops used to exist separated the wall in between that once the product was developed, it would be tossed over the wall for Ops to run. And this led to a situation where there was a lot of disconnect and honestly some tough outcomes due to the fact that they were separated. This led, as we know, to the DevOps revolution, which combined these two functions and allowed us to achieve much better outcomes by having complete ownership end to end. And SRE emerged as a practical way of achieving DevOps outcomes with recommended practices and frameworks for how we could do some of the activities that come with this model of integrations. But this pulled us away from the traditional approaches to running services, which formed the foundation for a lot of the ITSM methodology at the time. Next on prem workloads were lifted and shifted to the cloud and then re architected to run on containers and now shifting evermore to run on services that are platform as a service or function as a service. And this means that we went from a monolith architecture to tightly and then more loosely coupled microservices, and maybe next to multi runtime microservices. Continuing the trend towards smaller modular pieces. Being managed individually seems to be where we're at. And this led to a lot of interesting outcomes in terms of what used to be normal with ITSM versus what we see in this world. First of all, change management has evolved a lot because CI CD pipelines and DevOps practices mean that change management is often not the toll gate for deploying the prod that it used to be. CMDB, the configuration management database predicates a lot of its value on having a consistent inventory with rich attributes about all of the items that are a part of delivering services to your end users. Getting an inventory that's up to date for ephemeral resources, especially when they're hosted in the cloud, is near impossible, and consuming extra overhead to do it is very hard to justify. So we need to be looking at ways to leverage the information that we already have. Next up, service maps have become more challenging than ever, and modeling them using traditional systems of mapping that are available in these ITSM tools is suboptimal, and visibility suffers because black holes develop in processes. Executives and leadership can't really see the overall picture, and there's a lack of view into performance and user experience, user happiness overall, these are a lot of the things that the ITSM tool is supposed to be able to deliver. This might seem like bad news, but the reality is we have the ability to deliver a lot of these things with the observability data that we already have. We're just not leveraging it. So go with me here, and let's think about this from a visual standpoint. We have monitoring or APM or observability tools and the evolution that has happened there into the tools that we have today. On one side, and they're producing a lot of data right now. On the other side we have this iTSm tool like Servicenow or Salesforce or something like that, and it's producing a lot of data, but it's really from a process standpoint about how do we get the things done that we need to get done. Now, the thing is, in this current environment, because of the challenges that we just talked about, there's this wide chasm that has developed between these two sides, and the data passing between the two sides and the handoffs between them have honestly been really sad. I've spent years and years working with the integrations that exist between these tools, making my own, and I can honestly say that these are consistently some of the toughest integrations to get the outcomes that I want. Now, there have been some attempts to unify these two sides and drive some communication across this chasm. One example is just taking a webhook, and whenever there's an alert that's happening in an observability tool, we toss that thing over to the ITSM side. Now the problem is this just hearkens back to the old problem that we have with Devon Ops. We're just tossing the thing over the wall. There's no richness here, no workflow. There's no automation capability. And some integrations have tried, with varying degrees of consistency, to try to bring in host data or info about the entities that are being monitored by the tools. But this has been a very lightweight amount of information that's being brought so far. And I think if I had to summarize the biggest problem that I've seen with these types of integrations, it is that they're not approached from a perspective of the outcomes that we want to be able to drive. By bringing observability data and ITSM processes and automation and workflow capabilities together. Right. They're not driving it from an outcome standpoint. It tends to be from a perspective of we need an integration, we have to get this data across and we kind of check the box, right? I think we need to ask for more. That's where I'm coming from. And this has led to several subpar outcomes. First of all, we can't maximize the value of the tools that our organization is paying for. There's a lot of value being left on the table on both sides of the equation. Observability data can do a lot more than just toss an alert over the wall or create an incident. ITSM tools can do a lot more than just try to assign the incident and produce some MCTX dashboards. Troubleshooting is another area that we could do so much better. Observability is about context rich data. Don't listen to those people who try to tell you it's about three pillars. The webhook plus light instant data approach strips away lots of the most valuable information that is available here, and users are forced to do manual context switching between platforms, trying to figure out what the incident even means, where it came from, how to fix it, never mind who's affected or how badly. And these are the outcomes that are core to observability. Automation and continuous optimization are supposed to be core focuses of every discipline that we're talking about here. But again, the lack of tight integrations and thoughtful design for the interplay between these two sides means that many opportunities just fall through the cracks, and we want to break down silos. This is what drove the DevOps revolution to begin with. And although walls have been broken down between dev and ops, in many cases the ITSM team and their processes kind of remain on an island. Painfully. Those processes that are supposed to be protecting quality of service, the company's bottom line and user experience, come to be seen more as bottlenecks, red tape, and low value activities. And meanwhile, the lack of governance and process visibility with ITSM on the sideline can be a serious risk to the business. So what do we do about this? I took it as my mission that I want to try to contribute to a world where the combination of observability tools and ITSm tools can be more like the second emoji here instead of the first one. So I'm proposing a new framework that will allow us to get to the outcomes that we want and maximize what this relationship could be between observability tools and ITSM tools. I'm calling it the observable ITSM framework, and this is version one. And I've broken down some components of what I'm including in this framework across a few different areas. First, in terms of changes or deploys, we should be able to automatically create changes for CI CD activities and display those flags in the observability side too, so that we have full context what changes are happening when, and we can use that as very rich intelligence when we're debugging our applications, we need to enable the attaching of an SLO to a change request. If the SLO is burning post change, then back it out. And this aligns well with practices that we probably already do in SRE. We bring ITSm into the fold here, and it's something that can take zero effort from the DevOps side to make this possible. And we should be able to open change requests in the observability tool to see change outcomes as experienced by real users. Great examples of this is on the ITSM side, somebody's creating a change and they have tagged in a service that we're observing in observability tool, we should be able to open that up and see if as that change is deployed, if it affects the performance of our service. Because changes come from a lot of different sources and in a lot of different packages in terms of service components and maps, we need to be able to create records for all entities and slos observed by the observability tool. Create those into CMDB based on telemetry data, and they should be auto refreshed to avoid staleness. We do this so that we can then make really rich maps out of these entities, and we should map them based on host attributes that are in telemetry data and parent child relationships and traces. And all of this should be something we can set up in five minutes or less. Minimal steps. We need to be able to create rich incidents directly or via the event management processes that are including full context services, entities and slos that are affected and the ability to pass the responsible team or severity fields like this into the incident directly. What we want to do is enable the teams that are creating alerts to add some intelligence into those alert payloads that actually get processed automatically. On the ITSM side, we need to be able to open a detected incident from an observability tool in one click back into that tool for troubleshooting. We shouldn't have to be copying and pasting links or searching around trying to find a certain alert number or incident number in the tool that generated the notification to the ITSM side. And we should be able to use one click to open a user reported incident in the observability tool as well. Just because a user happened to be the one who created an incident and said hey, this thing is broken is no excuse for us to not have a good route for being able to open that up in the observability side. If you like the sound of this, then you're going to like this next part. Let's look at some examples what this can look like in practice. I have an integration that I made for honeycomb to integrate to servicenow and start to achieve some powerful outcomes that aren't possible with any of these other observability integrations that exist for other tools today. So first of all, in terms of setup, we clone this repo, we get this update set that's available here, and we bring it into the retrieved update sets in serviceNow. The setup is simple here and it's not even a store app, even simpler for the store app. So we open up the update set. So now we've previewed and committed the update set, we just connect a new environment by adding an API key. We'll give it whatever name we want and submit it here. We have the choice if we want to populate CMDB from the tool, or if we don't want to, I'll allow that. We'll submit. It's that simple. At that point, it's going to trigger a lot of actions behind the scenes that will allow me to route alerts directly to serviceNow. And I'll also import all of the entities that are being observed in Honeycomb into CMDB and map them into services. And allow me to do several of the other outcomes that I talked about a minute ago in terms of workflows that will help with troubleshooting, observing changes, so on and so forth. So now, for example, on the honeycomb side, look at an slO. We can configure a new recipient for burn alerts. And this recipient was created automatically and registered in the background by the integrations and we'll send it in as an event. We can see that on the servicenow side, we now have a new service that's been created called microservice demo. We can open it up and we can see the service map again. All this is done just by inserting API key. And this service map is drawn completely by using telemetry data that's already available in the observability tool. So we can see all of the services that are up at higher levels and we can even see down to Kubernetes pods and where we have data about them, they can see the nodes as well. So now that we have this service that is available, we can go ahead and hit this service with an SlO burn alert by triggering this Slo. Something else to note here is that we do have the ability in this integration without any touches from the user to be able to specify things like severity or Simon group that these alerts should route to or incidents should route. And so we'll show an example of that. So here we're specifying this one called cloud operator group. So what we can do now is I'm going to change the target for the SLo so that it will trigger. Okay, you can see that it is triggered now we go over to the service map side. We'll shortly see the service map lighting up with the impact of the alert that came in. And we can see that the severity of the service map did just change. Received a minor alert against the front end service. If we want to see what this alert is about, we can open it up here, we can see the details of the SLO. We can see the full payload. We can see that it was assigned to the assignment group that we wanted it to be. We can see the full payload down here. And most importantly, getting back to our conversation about making things seamless and allowing troubleshooting to be easy, we have a button here that says open honeycomb. We click launch and it takes us directly to the slo that is affected by the issue. And this doesn't have to be done through event management. We can also do the same thing. We'll have a recipient that's automatically created for incident creation. We can just as easily map this to that recipient. We have as well an incident option. Now let's look at an example of a change request. So we can see that here we have a change request that is making some changes to our cart service caching servicenow is tagged to is our MS demo service, which is observed by Honeycomb, as well as our cart service as the main configuration item that's being affected. We've gone through the process of getting our change ready to go, and the next thing that we're going to do is put it into implement state. Now that it's in implement state, we have a button here says open in honeycomb so that we can observe our change in real time as it's being deployed. And this pulls us into a query where we can see in real time the count of transactions and the heat map of their duration against this service 2 hours before and 2 hours after the change was starting to implement. And we can see that it's even scoped down to the service name, which is cart. One last example, let's pretend that we're a user who has come in and is reporting an incident for an issue that I'm seeing. Could do this from the service catalog or various other record producers on the servicenow side, or I could create it here directly. So I'll just pick a caller which can be me, I'll give it a description. The MS demo application is slow and we'll pick our MS demo service. We don't know anything else besides that, right? We'll just say, hey channel, this is a self service thing state. It's new. Okay, fine. And impact. I don't really know what this is all about, but I'm going to set it to a one because the service is really important to me. I'll save this incident now without knowing anything else. Incident was just created. We have. The only piece of information we know is that it's affecting this service and this brief description here. We now have an open and honeycomb button that we can go in and see what's going on with this particular service. Because we have identified that this is something that's being observed in honeycomb. Could be a kubernetes pod, kubernetes node. A service that underlies the top level service. In this case we're looking at the top level service. So we'll click open in honeycomb and we get back a query that shows results from just a moment ago until now about what is going on with this application. And if we wanted to, we could zoom out even further. We can say, hey, let me see what this is like the last 8 hours. What happened leading up to this point? Looks like there was a spike in latency a little earlier in the night, maybe. I want to go back and figure out what's going on with that. So if I'm responding to this incident, I have a very easy one click option for me to get into troubleshooting exactly what's going on here. And this is just the tip of the iceberg. What's possible with a really thoughtful integration between observability tools and itsm tools. So what's next? I'm planning to come out with a second version of the observable itSm framework in the end of Q three this year. Version 2.0 is going to be packed with big plans for some very cool features that I have on the roadmap. If you're interested in this journey, adding your voice, building out an integration like this, or just commiserating about the things that we want to be better, then let's talk. Look me up on LinkedIn and let's connect. You can dm me, let me know you watch the talk and two things you liked if you liked to talk, or two things you hated if you hated to talk. Let's keep the conversation flowing, challenge the status quo, and demand more from these integrations.
...

Matt Morris

Principal Solutions Engineer @ RapDev.io

Matt Morris's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways