Conf42 Site Reliability Engineering 2022 - Online

Future of observability in an experience-driven economy

Video size:

Abstract

AI has started to scratch the surface when it comes to deciding the future of observability. But it’s a long strenuous journey ahead as the technology is handicapped without the much-needed guidance of the DevOps and the SRE (observability) folks.

In the next few years, the observability’s future will be determined by shrewd IT teams prepping their systems and software by collecting, streamlining, and optimising the data (metrics, logs and traces) for the much-hyped AI-driven world. Troubleshooting and securing applications will get tougher as applications get heavy and complex with changing business needs, and as the experience economy kicks in.

What does this mean to DevOps and SRE teams? This talk will discuss what vendors and IT teams ought to do in preparing for an experience economy.

Summary

  • Conf 42 SRE conference focuses on future of observability in an experience driven economy. What are the differences and what is DevOps SRE and what does observability do with them? And how can tools help in achieving observability?
  • Sustained success lies with successful users. If you want successful users, user experience is important. User experience plays a major role as a differentiating factor if you want to stay ahead of your competition. 89% of companies have adopted digital first strategy and 86% believe cloud technology is critical for digital transformation.
  • What is observability and how is it different from monitoring? Is monitoring and observability the same or are there any differences? Then you can do monitoring and then you have to do analysis on top of it. But analysis is important.
  • DevOps is transforming to Devsec Ops, where security at all layers are important. With digital adoption, we are moving to an era of deploying multiple builds within the same day. DevOps people need to make use of tools to solve problems easily.
  • If DevOps is about principles of what to be done, SRE is about how you do things. SRE are the more ops oriented and they are more towards the production environment. They are to do with the reliability of the systems and particularly things when it is in the cloud. End to end visibility of all these layers is important.
  • The three pillars of observability are metrics, traces and logs. The metrics that you are monitoring will vary depending on the component that you sre monitoring. So performance metrics and scalability are key to achieving observability.
  • Traces is to do with having an end to end visibility or pinpointing to the line of code that is having issues. Industry is moving from a monolith architecture to a microservice architecture. One problem will have a cascading effect on the entire architecture of your application.
  • Logs. Why has logs become a pillar? Is all of us. Whenever we have a problem, we go look at the logs to find out what are the issues in it. AI ops which people are using helps SRE. Chatbots the chatbot integrations. We are moving towards being proactive rather than reactive.
  • PsI twenty four seven is an aipowered full stack monitoring platform. We shape our tools and they in turn shape us. It's important for you to choose the right set of tools depending on what your business needs are. I'll be happy to arrange a one on one session if you need a demo of the product.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, I'm Rajulakshmi from Psi 24/7 happy to virtually meet you all in this event. Conf 42 SRE conference thank you for joining my session on the topic future of observability in an experience driven economy. And the agenda that we'll be discussing today are about digital experience, the importance of it, observability versus monitoring. What are the differences and what is DevOps SRE and what does observability do with them? What are the pillars of observabilitys, how you can achieve them? And how can tools help in achieving observability? So first, when we talk about experience, particularly the digital experience, the first thing that comes to my mind is this book by Kathy Sierra. The book by name, badass making users awesome I'm sure some of you would have read this book, or I would strongly recommend you to read this book because she gives a different perspective of how you have to keep your users in the center of your business. And this opens our eyes in a different perspective. And there is a question in this book, and this goes like this, which would you rather have a user feel? And she goes on to give four options, and you have to choose only one of the options. Let's see what comes to your mind when you look at the options. The options are, do you want your users to have the feeling that the product is awesome? This is the product that you are building, or you own the product. So do you want your users to have the feeling the product is awesome or the company is awesome? The company that is building the product is awesome or the brand is awesome? The brand that's owned by the company that is building the product is awesome. Sounds tricky, right? All of these answers look like they are correct, but she says none of this. There is one more choice, which is the right answer and which is I am awesome. So you have to make your users feel that they are awesome. But wait, people don't actually talk like that. Nobody says, I am awesome because of this product. They only go on to say, I like this product or the app is amazing. That's a question that comes to all of us. This is exactly the snippet from the book. So she goes on to say, when a customer or when a user, he or she says, this product is amazing, you should see what it does. They actually mean I am amazing. You should see what I can do with it. So when you have such users who have a feeling that they sre awesome because they are using the product, they will do a lot of things. When a user feels they are awesome. These sre some of the things that they will do, they will go about talking for the product, evangelizing, free marketing. They will remain loyal, they will tolerate problems. Whatever be the problem in the product, they will be able to tolerate it and they'll resist competition. They wouldn't want to go to the competition, they would want to stick with the product that they sre using. And they'll form community groups. And there are situations where even before the customer, even before the product owner answers the questions, there are diehard fans who would want to solve problems for other users and they'll show off their results and lot more. So these are all the outcomes when you have the user feel that they are awesome. So such users are the secret for sustained success. The successful users, all of us at business, we want to be doing business in the long run. Sustained success. The secret lies with successful users. And if you want successful users, user experience is important. And in this digital world, digital user experience is important. User experience could be in any format. It could be in the way your support technician picks up the calls and answers a query. Or it could be the way how you are designing your UI, keeping it very simple, having all the configurations, easy to do, easily understandable. Or you automate many things such that the user may not do out of the box stuff. It could be in any format. But experience plays a major role as a differentiating factor if you want to stay ahead of your competition. And there is an article, Forbes article, that talks about 100 things in the digital world and user experience, 100 stats, and I've just taken few from that. 89% of companies have adopted digital first strategy. They have started moving to the digital world. And 86% of companies believe that cloud technology is critical for digital transformation. You want to take your product to the digital world. You can't say that I would want to take it to the digital, but I will still be using my legacy code, legacy software that I have been doing. Technology? No, you have to adopt to the relevant cloud technologies if you want to adopt digital. And 67% of consumers will pay more for a great experience. So experience is important. They don't mind paying for that extra for that. 87% of companies think digital will disrupt their industry. So people have started feeling, and I should say this last two and a half years, this pandemic has only accelerated the digital adoption and people have started moving to the digital world. 83% of enterprise workloads are in the cloud. It's not that only the startups or SMBs are in the cloud, even enterprises have realized the importance and they are moving to the cloud. So all these facts and the references that I quoted goes on to say that digital experience is important and we are living in this era of experience driven economy. This is going to have an experience that you're going to give for your customers, is definitely going to have an impact on your business and the economy of your business. Let's now look into observability. What is observability and how is it different from monitoring? Is monitoring and observability, are they the same or are there any differences? So I would want to put it in a simple way as the observability. When you talk about observability versus monitoring, the base of any system or any monitoring that you want to do is observabilitys. That is the underlying platform. Because only if your systems are observable, whatever you want to observe, be it your server or your application, whatever you want to do, the system have to be observable. There must be some ways, some APIs or some protocols using which you can fetch the relevant data that you want to monitor. So the underlying is observabilitys on top of which monitoring lies only if your systems are capable of being observed. You can do monitoring on top of it and you can collect all the relevant metrics. Depending on what you are monitoring, the metrics will vary and you can collect the relevant metrics. So that monitoring is the second layer and there's no point in just collecting all these metrics and keeping it with yourself. You have to do analysis on top of it with the data, the humongous amount of data that any system is collecting. There has to be analysis and there has to be segregation and make meaning out of the data that you collect. There's no point in just collecting data if you're not going to make sense out of it and give that benefit to customers. So this is the stack that I would call observability is the underlying platform. Then you can do monitoring and then you have to do analysis on top of it. Don't just keep your data idle. If you're not going to do anything with the data that you're collecting, do not collect it. Why waste your resources? So keep in mind, all the data that is being collected has to be analyzed, maybe for various reasons only. If you have such data and such analysis, you can apply the latest technologies of the AI capabilities. I'll be talking about AI towards the end of the session too. But analysis is important. Let's move on to how I see DevOps and the definition for DevOps SRe. And what is that to do with observability? DevOps. So if you take DevOps, there are different definitions of continuous feedback mechanism, is what we all know of, but I see it in a different analogy. So there used to be, no, I shouldn't say there used to be. There is still this role called developers whose actual role is to write code and then make sure that there are no errors in the code, bundle it as a product, build it, attend to all the code level errors, and they think with that their role is over possibly ten years ago. I've been in the industry for 22 years, I've been a developer myself. So we used to do this. We think that that's the role of developers, and with that we are done. And we pass this on to another person who's the operator, who actually takes care of deploying the build, attending to customer complaints, making sure the application is up and running, keeping up with the slas, fixing any problems that happens in the system. So all this used to be the role of operator. So these two roles were actually, or these two roles, these two persons, they were actually different and there were a lot of blame games. This is not my problem. This is not my problem. This used to happen, but with digital adoption, with the latest technologies and the transformation that is happening, where we are moving to an era of deploying multiple builds within the same day, we at site 24/7 we do deploy, develop, and deploy three to four builds a day from what it used to be three bills. I mean, a build in three months. That's how systems or companies are evolving in such a situation. These two roles cannot be separate and they merged, and that's when DevOps was born, where this is how I sre DevOps as a person or a character, or however you want to define, they have to have some amount of knowledge of what is happening in development to what is happening in deployment. Only then they'll be able to quickly fix any problem and then take it to the production environment. And as seen in this diagram, he or SRE is not a superhero to do all this by themselves, they need to make use of tools. There are a lot of tools, be it in house tools or third party tools, open source tools, lot of tools available, which helps DevOps people in their day to day activities to make sure that they are able to solve problems easily. And DevOps is moving from in the cloud. When we talk about things are moving to the cloud, digital is being adopted. DevOps is transforming to Devsec Ops, where security at all these layers are important, be it at the application layer or at the infrastructure, at all the places, it has to be safe and secure and that has to be taken care. That is added as an additional role for DevOps. Now this is about DevOps. What is SRE? Are they both different or are they both same? There are a lot of, what is it again? Definitions for these two. But the simple definition or the simple way of putting things about SRE is if DevOps is about principles of what to be done, SRE is about how you do things. So if you take the differences of DevOps and SRE, DevOps might go with connecting the development and Ops team with a set of principles and the primary focus here on the delivery. Whereas SRE are the more ops oriented and they are more towards the production environment, where they respond to incidents, monitor all the events and make sure that they reduce the fault and takes care of automation. They are to do with the reliability of the systems and particularly things when it is in the cloud and where you have to take care of all your deployments. Sre plays a major role. And if I have to put this in the development terminologies or in the Java terminologies, if I have to say SRE implements DevOps. That's how I would want to call this as. So with that SRE definition, if we have to say, if you take a cloud architecture, there are various layers in any cloud architecture, and starting from end user layer to application to platform to infrastructure layer. And it is important for the SRE to have an end to end visibility across all these layers, because the problem could be anywhere. If you see it in the cloud, if your application is going down, it could be because of an ISP problem in the customer end, or it could be because there is a problem in the way the application is written, there is an indefinite loop that is happening in the code, or it could be in a database connection being not closed, or there is a leakage of resources that is being used in a file in a platform layer, or even it could be in a problem in a port, in a switch at the infrastructure layer. Any problem anywhere in the stack is going to impact your application, it's going to impact your business. So end to end visibility of all these layers is important, and SRE has to know that for which tools will be helpful. And that's about the observability. SRE has to have an overall view of what is happening in all these layers. Now, when we talk about the pillars of observability. All of us know the three pillars. There are three pillars. In fact there is another fourth pillar that is getting added. But the main three pillars of observabilitys that is being discussed are about metrics. We'll see in details. All of these two metrics is the first pillar traces where you get end to end visibility or you get the line of code that is having issue and logs. So these three are considered as the three pillars of observability. So let's get into the details of what do you mean by metrics, traces and logs. That's what we will cover in this section, achieving observability. So when we talk about metrics, as I initially said during the observability section itself, anything that you want to monitor, the metrics that you are monitoring will vary depending on the component that you sre monitoring. If you are monitoring your server, the metrics will be what is the cpu utilization, what's the memory utilization, what are all the processes that are running in the system. So those will be the metrics. If you're monitoring your application, the metrics will be what is the response time, how many times a particular transaction is being called and how many times people are hitting the system. So those will be the metrics. What is the database calls. So those will be the metrics. And if you're going to monitor your database, the number of connections, are all the connections closed? What are the slow queries? The metrics will vary. So depending on the component that is being monitored, the metrics will vary. And irrespective of whatever be the component that we are monitoring, we have to make sure the basic things are being collected. So in metrics, whatever be the component that you are monitoring, the primary or the important thing is, but the availability metric, uptime is an important metric to collect, be it your application or your database component, or your infrastructure server network, underlying components are all of them up and running. The industry standards expect 99.99% availability. It's almost like 100%, but you can have small difference here and there. So the industry standards have moved from three nines to five nines. That's the expectation. That's what the competition is giving. It has to be all the time, most of the time up and running, which means you have to monitor all the resources, the entire stacks. Availability metrics have to be monitored. And in the cloud, security metrics have to be monitored. Security metrics again will vary. If you sre talking about an application, the metrics will be about have I made sure that how sre you defining your security XML? Have I made sure that if there are going to be hundreds of requests coming in at the same time, have I defined all the thresholds properly? So those are some things that you have to monitor. And at the network level, am I able to monitor all the relevant details and have I made sure that one user's data is not accessible by the other user being in the cloud, people are believing or people are trusting the vendors and giving their data into your system. So you have to make sure that the user segregation is taken care properly when you are designing your application and when you are designing your database itself. So there are various metrics which you have to look into with respect to security aspects too. That has to be taken care. Then performance metrics. You have to take care of all your availability metrics, but that's alone not sufficient. You sre having your resources up and running, but if they are going to be performing very, very slow, it's going to have an impact on your business. Industry expects, industry standards expects just 2 seconds for any application to respond. And it plays an important role in defining your SEO. What is it in defining your marketing standards and making sure that you are able to come up in any of your search engine optimization. So performance plays a key role of how quickly you sre able to respond. And performance again has to cater with all the layers, with your applications performance, your database performance, your server performance, your network performance, your end users performance, where you have to take care of your ISP, your browser, your device type and the version. All those plays an important role. So performance metrics have to be monitored and scalability metrics. There are different areas, the different domains. If you are a startup to SMBs, to enterprises, you have to take care of how quickly can you define and you can define and do auto scaling. And when we talk about scalability, both the aspects of vertical scaling and horizontal scaling has to be taken care. There could be situations where depending on the load, your system can only handle certain amount of load. You want to add more instances and take care of horizontal scaling. Or there could be situation where there are cpu intensive calculations that are happening where you want to increase the size and take care of vertical scaling. All these metrics have to be taken care. Then cost metrics have to be taken care. When you are deploying in the cloud, are you really utilizing the resources that you have purchased? Lot of time. What happens is when we purchase the system or when we use a cloud environment, we use our credit card and we start using it with that. There is another department, the finance department, that takes care of paying the bills every month. Are we fully utilizing the resources those have to be monitored there SRE surveys that says that 30% of the resources are not being utilized under utilization. So you need to measure all those metrics and make sure that you use fully of whatever you are preparing. So cost angle have to be monitored. These are the different types of metrics that you have to take care of when you are monitoring your entire infrastructure. Moving on to traces what is that you have to do in traces? Traces is to do with having an end to end visibility or pinpointing to the line of code that is having issues, particularly where the industry is moving from a monolith architecture to a microservice architecture and each of the components is running in its own container, where the container can be spawned, deployed, destroyed of its own. So it's very changing. Even though the architecture looks very simple, monitoring this environment is very changing. Each of them can have its own programming language too. Then how do you make SRE to find out where the problem is? So in such situations, tracing across all these tires, be it your client or your web or your server or your data tire, where within your server layer you can have set of missions for data securing, set of missions for data collection and set of missions for data processing, you need to know exactly where the time is being spent. There are situations we have faced where a particular transaction takes or does millions of method calls and you will get such issues only in real time deployment. So you need to know it's very tough or it's very changing to find out the problem in the real time deployed environment. So the tools will help. And tracing across all these layers is also important. Distributed tracing in a microservice architecture where one problem will have a cascading effect on the entire architecture of your application. So it's important for you to trace them and then moving on to the third pillar, which is again, which is logs. Logs. Why has logs become a pillar? Is all of us. Whenever we have a problem, we go look at the logs to find out what are the issues in it. And when we have our deployment in a distributed architecture, finding out exactly, or taking remote control of each of those missions and going and looking into where the problem is, is going to be challenging. Instead, if we are able to collect all these logs from the distributed architecture, do some processing on top of those logs, and store it in such a way where you can easily touch them and look at what the problem is, that's going to be helpful for SRE. So consolidating logs from across all the servers, doing all the real time, prepping and storing it in an easily queryable format is what is log management. In one simple term it is converting your unstructured data into structured data is log management and out of the box there are support for all the common applications that are available and the logs SRE also helpful for your audit trials, be it your database logs, applications logs, network logs, there are different types of logs and all these logs SRE required in a cloud environment. These are required for your auditing and compliance too. And these logs will actually help you to find out what changes have gone in, who has done the change and where was the change done, when was this done? These are some important queries to be addressed to build the trust because sometimes you might never know. The customer will come and say I don't know when this change was done or who has done this change. Can you help me in finding this out for which log management will be helpful? So to put it in a nutshell, metrics traces logs forms the three pillars of observability and definitely the AI ops which people are using helps SRE let's quickly look into just spend some two 3 minutes and look into how AI ops helps SRE. There are different ways how AI can help. It can help with evaluating past performance because you have all the data in your system. Based on the historical data that is available with you, the AI system will be able to predict and tell you what to expect. Suppose you had a sale last time, new year sale, you had a huge number of people landing on your system and you wanted some ten extra servers. Based on these data, when you are planning for another sale, the system will be able to tell you what is that you have to be prepared for and it will also enable help in enabling communication. Chatbots the chatbot integrations because all of us have our own communication channels, possibly you're using slack for your communication, you're using Microsoft Teams for your communication, and you don't want to go into another mission to look at or another window to look at what the problem is. Whatever be the monitoring solution, you can directly integrate into your chat communication using chatbots. That will help. That's again possible using AI Ops manage a flurry of alerts any monitoring tool is to do with having a lot of alerts, segregating the alerts based on the severity, assigning it to particular technician. All these can be the AI can help in doing it automatically so it can help. One step ahead for you in resolving the issues and then data correlation across tools. Has any system uses multiple tools, so the data results from all these tools can be correlated with the help of AI Ops. Again, correlating test results we do a lot of testing. We do development testing, changing testing, deployment testing, post production, pre production testing. All these test results have to be correlated where AI can help and when we talk about AI, one of the important thing we have to keep in mind is exhaustive. Training is important. The AI system is only as powerful as how much it has been trained. The accurate data points that are fed into the system will help it to make correct prediction. So that is something that we have to keep in mind when we talk about AI Ops. Based on all these data, it can do an exhaustive self training so that it prevents false alerts. So these are some ways in which AI can help. It can help in decision making coming up with forecastings dynamic threshold settings the user need not set any threshold. Make it easy for customer. Let the system decide based on the historical data. I will adjust the thresholds and for each of the transaction the threshold can be adjusted. Each of the server the thresholds can be adjusted. You can take a lot of collecting actions using automations and scriptings which can be run automatically based on the predefined threshold settings that SRE available. These will help in resolving the problems and it will help the SRE to make sure that they keep the system up and running all the time. Conversational chatbots too. The aiops there is a future is in aiops, finding the accurate anomalies, reducing the noise, more forecasting, and we are moving towards from being proactive rather than being reactive. So all along the industry has been in a situation where when a problem occurs, how do I go and fix it? What are the tools that can help me to fix the problems from that situation it's moving towards let's be proactive, let's make sure that the problem does not happen in the first place at all. So to achieve that, AI will be helpful. So in a nutshell, if I have to say, metrics, traces, logs combined with AI Ops will be able to help sres in their day to day activities and make it smooth and easy and pass on that benefit to customers. These are possible with the help of monitoring tools, collecting all your metrics, your traces, your logs, all in one console and apply AI on top of it. There are many such tools available in the market. One such is site 24/7 which is an aipowered full stack monitoring platform that lets you take care of all your monitoring needs. The stack that I talked about from one single console. So we do have from website monitoring to server to cloud network, application performance, real time real user monitoring, application log management, cloud spend and status iq on top of it. We do have alerting, reporting and apply AI on top of this and PSi twenty four seven is hosted on Zoho's data center. Zoho has been in business for 25 years. PSi twenty four seven is a mature product in the market for close to 16 years now and we are hosted on Zoho's data center. Zoho has its data centers in five different regions, ten different data centers in each of the region. We have a primary and a secondary data center. The customers can choose the data center so that the data resides within the geographical boundary of that particular region. We have it in one in us, one in Europe, one in India, one in China, one in Australia. We are coming up with more data centers depending on customers'requirements, too. Being a cloud provider, we do take privacy, security and compliance very seriously and get all the required or relevant certifications that are required for us to be a cloud provider. So the key takeaways from this session, the last almost 30 minutes that I've been talking SRE about experience is important, and when it is in the digital world, digital experience is important. And when we talk about observability, the main three pillars and how you can achieve the same, and AI can help sres in achieving observability. PsI twenty four seven is an aipowered full stack monitoring platform. So I would like to close with this quote. This is one of my favorite quote, and the quote goes like this. We shape our tools and they in turn shape us is a famous quote by Marshall McLean. And what this means is the tools that you have in your hand has a great impact on the day to day activities that you do. When you have a hammer in your hand, everything looks like a name that's a screw over there, but you only have a hammer in your hand and you will end up only hitting it. So it's important for you to choose the right set of tools depending on what your business needs are what your customer requirements are, so that you can take your business to the next level and be successful in whatever role that you are playing. Thank you for your time. Have a nice time. In the event, if you have any questions, feel free to write to me or write to support email id that's provided here. I'll be happy to arrange a one on one session if you need a demo of the product. Thank you.
...

Rajalakshmi Srinivasan

Director - Product Management @ Site24x7, Zoho Corp.

Rajalakshmi Srinivasan's LinkedIn account Rajalakshmi Srinivasan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways