Conf42 Site Reliability Engineering 2021 - Online

Peek into Observability from testers lens

Video size:

Abstract

It is common yet quite new to hear the term ‘Observability’. But what does that mean? Is it just another new acronym for monitoring?

In this current modern technology world where we are working with so many different types of systems - microservices, distributed system and many others which are kind of huge spider webs. Imagine while testing these distributed systems and you have no clue of what’s going on under the hood. Gone are the days where testers have to rely only testing the user interface or the api’s. I worked on a distributed system where no one had an answer to what’s going wrong whenever there was a production issue each time. We had some monitoring and logging in place but we had no clue where to look at when things went wrong on production. There was a need to have more powerful insights of the internals of the system and more than that there was a need for visibility to understand what’s happening under the hood of the system which is giving the team the superpowers to predict the future.

This is where we started first steps into ‘Observability’. In this talk, I’ll share my journey of adoption of a culture of observability within the engineering team.

Key takeaways * What is observability * Why is it important * How does it help the team * How observability can support testing.

Summary

  • Parveen Khan is a senior QA consultant at Thoughtworks based in London, UK. He will talk about a peek into observability from tester's lens. You can enable your DevOps for reliability with chaos native.
  • We are good drivers because we are making decisions based on risk. Our cars can drive much faster, but we just hold them back because we know it's bad to drive when there's no visibility. How about these planes all the time? They fly among the clouds, right?
  • Observability is a measure of how well the internal started of a system can be inferred from its external output. It means you can answer any questions about what's happening on the inside of the system. How can we get the data we need by adding an observable system?
  • If you're talking about distributed systems, we don't want to get into each different services and try to look at these logs. So rather than having, we should have them centralized at one place so it's easier.
  • Testers are really very curious explorers and great at asking questions. Trace is kind of like telling you a story which gives more low level details. Having access to these kind of tools allows the testers to look under the hood. It's not just about finding the information, but it could help us uncover understanding of our product.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus cloud hello everyone, I'm Parveen Khan. I'm a senior QA consultant at Thoughtworks which is based in London, UK. So today I'm going to share my experience and talk about a peek into observability from tester's lens. There might be quite a lot of you all who already know about what observability is, or there might be few people who are new to this and that's why you are here for things conference. The purpose or aim of this talk is to introduce you all to this topic and also introduce you to show how it can be helpful for testers. So before even jumping into the topic, I want to quickly take you through a simple scenario and pray for this scenario goes to Pierre set. I really liked how he used this example to explain this concept, so I'm using the same example. So imagine what would you do if you come across this foggy road while you're driving in a weird weather condition. One thing comes to our mind is that we need to slow down, right? But why do we need to slow down? It's because we don't have the visibility of what's ahead of us and we kind of consciously make the decision based on the risk, right? So if you're not able to drive, does that mean that we are bad drivers and whats we can drive fast enough on a forky road? Not at all. Right? In fact we are good drivers because we are making decisions based on risk. So does that mean whats the car isn't good enough to drive faster? Not at all. So our cars can drive much faster, but we just hold them back because we know it's bad to drive when there's no visibility. So we are kind of ultimately stuck, right? But how about these planes all the time? They fly among the clouds, right? So it is because pilots have additional instruments to do that and they don't have to solely rely on their eyes. So now what is it to do with software development? We all know the current trends is all about going faster and we all are kind of adopting different practices like agile DevOps, working in different distributed systems, working in microservices and whatnot. We want to deliver value quickly, right. By doing this all we are doing is we are building faster cars, we are moving into microservices, architecture or distributed systems and the reason why we are doing is because of simplicity of development, right? But on the others hand, there's a lot of complexity and multiple moving parts at the same time, which means it's more even more complex. When distributing a system. We are also distributing the places where things might go wrong. So we know that we need visibility, but how can we get that visibility into our system? So the answer to that is by having observability. So before even going ahead in trying to understand what observability is, let's first try to understand why do we need it in first place. Like I always look for real life examples to understand any given concept and kind of don't get convinced by just reading through theoretical concepts. So I'd love to share my real world experience with you of how I came to an understanding of why we need observability on the system that we were working on. So I joined a new team and the entire team was new. So I had an opportunity to join this team and work on really exciting and interesting product. It was a completely new domain for me and it was kind of an automated invoice system which was built on a microservices architecture. So the work which we were doing as a team was to build new features and also fix the bugs. So as a tester, when we start working on a new product, the first thing we try to do is to try and understand the product. So the more I was learning, the more I was trying to learn, but the more I was trying to learn about the product, I would feel that it's too complex. And another reason to feel this way was that I have seen a pattern, like a pattern of a lot of tickets being marked as blocked. I used to see a lot of production issues each day and developers would pick up those and investigate to find the root cause it. They used to spend days and weeks kind of like and then marketers blocked because they couldn't find any information and they couldn't find why it was causing this issue and they even couldn't find where the issue was. That kind of made me think about like what's wrong here? So at this point I really stopped thinking that the product is complex as an issue. So interestingly, at these same time I was trying to read a lot about observability without even knowing when and where it can be used. Not just reading but using different kind of tools. Whats promise to deliver observability to see what does it bring. So having a conversation at that point I was trying to have some conversation with one of my team member who was a developer. So that conversation gave me some food for thought. Whats this is what we are missing on our product and that's what is observability. So the conversation was like a light bulb moment for me because that kind of unlocked quite a lot of answers to few questions, but it did open up a lot of questions too. Of course I got an answer that we had very less or kind of no visibility onto our system, which is why a lot of issues were marked as blocked as the developers could not debug. And because these could not debug, they could not find the root cause. I keep talking about observability. Now let's try look into what it is. So there are quite a lot of definitions that can be found if we try to google about it, but this is kind of a simple one which I thought I can share. So observability is a measure of how well the internal started of a system can be inferred from its external output. It means you can answer any questions about what's happening on the inside of the system just by observing the outside of the system and without having to ship new code to answer new questions. When systems are down, you need to find answers by asking questions as quickly as possible. Right? So the system needs to be observable so that it can explain what's happening, so that we can find out what's happening on the inside of the system by just observing from the outside. But how can we make the system observable? The answer is by using the data. Now, how can we get the data? And what type of data do we need to have an observable system? We can get the data to make the system observable by adding instrumentation. And that instrumentation can give us the data that can be in the form of, like, it could be logs, it could be traces, or it could be metrics. So now let's talk about each of these before moving ahead with the story. What are logs? Okay, a log is a simple message which has some kind of information. It might have a timestamp and a payload, and that can help us give more context. Right? Again, if you're talking about distributed systems, we don't want to get into each different services and try to look at these logs. So rather than having, we should have them centralized at one place so it's easier. Right? So let me tell you, while I was working with this team, we used to have logs. It's not like we didn't had any logs, but we did had some logs, but they were all stored separately for each service. And what we used was like we used n log. And to access those logs, we had to access those separately for each service. And only way to view those was to open it in notepad plus plus. So whenever there used to be an issue, we would end up like having multiple notepad plus plus tabs open. So it was such a pain to add to that, the way we could search the logs was by using control f. Can you imagine? So this is the reason why logs should be centralized, so that we can access all the logs at the central place, and the log should be easily searchable. And the way we can make it easily searchable is by having the structured logs. Now, coming to the metrics, a metric could be a simple trending number, or it could be a simple value that kind of like expresses some data, but the system. So these metrics might represent different things. Like metrics might have some name, the time and the value. So these metrics are usually represented as counts or measures, and kind of often calculated over a period of time. For example, a system metric can tell you how much memory is being used by a process out of the total, and an application. Metrics can show you the number of requests per second being handled by a service. Or it can tell you error rate of an API and business metric could be something like how long does it take for a user to log in? Or how long does it take for a user to do certain action while using our product. So metrics are really good at aggregating things, but not really good at pinpointing specific detail about something. Like at this particular time, this is the customer who was having a problem. So how could we do that? By using traces. So trace is kind of like telling you a story which gives more low level details. It kind of shows the entire flow of the request, and I think it's kind of a really valuable while debugging. So a single trace shows the activity for an individual transaction or request or event as it flows through an application. So it kind of shows the end to end request. And traces are kind of very critical part of observability, as they kind of provide a lot of context. Okay, so I've been saying that with observability we can ask questions, but what kind of questions can we ask? So I can give you an example of some of the questions that can be asked. Is something like, why is x broken? So what service does my service depend on? And whats services are depending on my service? What went wrong during this release, why has the performance degraded over the past quarter? Or what logs should we look at right now? Or it could be like, what did my service look at this point? X. So just like how we talk about DevOps, we cannot say we are doing DevOps by just having some automated tool in place or by having some sort of process in place. It's more than tools, it's more than okay. So it is kind of a cultural and mindset change. Similarly, we cannot say we are doing or having observability by just having some different tools in place and logging some information. It is not just about getting the tools and sending some data and trying to observe the system. It is a cultural change. Now, you might be thinking what's in for testers with all the observability and all these new tools, why and how is it useful for testers, and how testers can be helpful and useful and they can use absorbability and how can they use it? So testers are like, you know, it is, it is easier to find more information around the issues, right? So for example, while we are testing, we might see some unexpected behavior or maybe see some kind of failures. So having access to these kind of tools and having these kind of tools in place allows the testers to look under the hood to find out what is happening with the request. And not just that, but it also allows testers to learn more about the system of how it communicates and works. So like for example, I would be using devtools to see what's going on when something didn't look right or while I was testing or while I was looking at from the UI point of view. But I wouldn't get enough information by just looking at the devtools. So by having these tools in place helped me in getting more information that could be added to the tickets while we are raising the bugs, which can be helpful for the developers. It's not just about finding the information while looking into the issues, but it could help us uncover understanding of our product, which is really very important for a tester. Testers are really very curious explorers and great at asking questions. So things could be a tool for exploring and asking questions. So as a tester, I tend to ask a lot of questions when I don't understand things. Testers are great at exploratory testing, not just good at asking questions, but testers are always curious to find the information about these system. So while exploring the logs, the metrics or the traces or any kind of data, testers might point out where there is need for more instrumentation. And not just that, but it also supports and helps testers for testing in production. So it allows the teams not just to shift left but also to shift right. So I really like this tweet by Mahd and how this has been put together, saying that a lot of times good debugging and good exploratory testing are both indistinguishable. When developer explores, they call it more often debugging, whether they know there is a problem or just suspect there could be. When testers explore, debugging is close to the last word being used. So to summarize, by making systems observable, anyone on the team can easily navigate from effect to clue in the production system. It makes it easier to debug. The goal of observability is not just to collect the logs, metrics or traces, but using the data to get the feedback. It just doesn't allow us to find the knowns of the system, but also allows us to know the unknown unknowns. So every learning experience and every journey has something to take away. So I kind of had some learnings as well to take away from this experience. So the key takeaway for me was that we as testers can go out of the way and think outside the box. We care and advocate about quality and that could be related to bringing in the improvements in the process and bringing in the new tools related to test automation. But that's not the limit. I learned that we do not have to limit ourselves and say that this is not related to testing. So let's not look into this or let's not learn about this. I saw the problem my team was going through and the problems were like developers getting frustrated when they couldn't resolve production issues and the product, others getting frustrated because they had to answer the clients and these had no enough information related to those production issues. I didn't knew the answer or solution to it, but being active in the community and seeing new tools and concepts and exploring them and then finding these solution and then trying out myself using open source tools and then presenting that as a suggestion to my team by building a proof of concept kind of help, not limiting myself to testing tools only and trying to think outside the box to help my team. And when I left the team, we were not yet in the terms of complete observability implementation, but we kind of had started our first steps into it. So we were like from having no visibility to we kind of had structured and centralized logs that can be easily querable. And we were then taking the next steps. So to end with, I would like to say that observability gives power to the entire team to get the visibility when needed. And observability is much more powerful when you apply with the right mindset and clear processes in place. It allows these teams to become proactive towards the issues rather than being reactive. It kind of gives power and superpowers everyone on the team, whether it's developers, whether it's ops engineers, whether it's sres, or whether it's testers. So thank you so much for joining my session and listening to my story. Happy to answer any questions if you have any, and be sure to check out my blog post pervincans.com and do follow me on Twitter at pervine. Thank you so much.
...

Parveen Khan

Senior QA Consultant @ ThoughtWorks

Parveen Khan's LinkedIn account Parveen Khan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways