Observability Passport: Navigating the What, Where, and When of Your Frontend App

Video size:

Abstract

We created an Observability Passport for every project — a single source of truth with links to logs, Sentry, telemetry, and security tools. I’ll show how it simplifies diagnostics, speeds up incident response, and improves app stability and team transparency.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Okay. Hello everyone. My name is Saddi. I work at ETG and I am currently lead the front end team. I mainly work on improving front tops, the wax and other technical areas in the transport product department. So what are we going to talk about today? Not long time ago, we had a typical situation. With one of our products, something broke following best practices. We returned the service off and on again, but it didn't help. It also didn't help that there were no errors in sent and we did not know where to look to for the problem. Only after some time, we found out that the problem wasn't with us at all. One of our teams that depended on our service through our common integration layer, released an update. Their CPU usage suddenly increased and this affected our system without us noticing. This story made me think how we can, I really know that the product works and works properly. Recently, our team launched a new product car rental. The service is available for B2B partners and corporate clients. We launched this service. I decided to approach this problem more thoughtfully, so how we can know that the products works and works properly. The simplest way is to type the application address in the browser and see that everything works, pages open request present, and gone. But you'll agree that checking the application this way every time is impossible. Especially when support comes to us and say everything is broken. In this case, there is no point in checking anything. We're already down and losing money. We want something that instead of support, our clients will tell us in time that something is going wrong and your developer should take a look before the application crashes. As you might guess, I'm talking about the three pillars of observability, metrics, traces, end locks based on which we can understand what happening with our application. So now we will learn what is observability and what is monitoring, what are the key difference between them? What's the metrics and what types are there pure technical metrics? How SLA different from SLO and SLI and what is an observability passport and how does it help? So let's let's start with the basics. What is the metric and what types are there? Let's start with a simple definition. A metric is an iCal indicator that measures a cer certain aspect of system operation. For example, page load time, number of errors or presentation of successful requests to the server. End goal metrics help us measure, understand, and improve our applications. What types of metrics are there? For example technical ones. Technical metrics describe the internal processes of the system they're most often needed by developers to understand what's happening under the hood of your application. For example, CPU usage, RAM usage peer response time, and go on then are also business metrics that show how efficiently the application helps achieve business goals. For example, conversion in purchase organization, user retention, or for example, giving you per user. But there are also metrics that difficult to attribute the one category because they affect both technical and business sites. They are also called hybrid metrics, for example performance and page load speed. The well known web vitals by Google. One hand metrics describe purely technical characteristics of your application page, load speed reaction time, et cetera. But in the end, all this affect cell user experience and in search in search results, which can affect your business wallet, or, for example, the number of errors in the system. Frequent errors, areno users, and increase the load on support and can affect business metrics such as user tention. Another good example of hybrid metrics, which lay between developers and business can be availability and availability. Medics, S-L-A-S-L-O and SLA. This set of metrics focuses on measuring fulfillment of commitments to deliver high quality, reliable, and stable services. What does it mean and why is it needed? So SLA service level agreement is an agreement about providing a level of service made between business and development. For example, agreements commit that the service will be available 99 and 9% of the time during amount. SLA in this case is used as a formalization of business expectations. SLO service level objective is an internal goal related to system availability. SLL defines more detailed indicators that developers focus in their daily work. For example, API response API response time should not exceed the 200 milliseconds for 95% of requests. SLL helps technical teams focus on what is important for their users and set realistic benchmarks for monitoring. SLA service level indicator is directly a metric that measures the fulfillment of SLO. This is a numerical indicator based on monitoring data. For example, if SLO says that 95% of request should be proposed in 200 milliseconds, then SLI is the percentage request that actually fit. Within this time. How do this medics help to solve problems? Availability metrics help developers evaluate the current service level. You know exactly how your performance goals match reality. They help prioritize tasks. For example, if the availability level approaches the SA boundary, this is a signal to immediately address the problem and help communicate with business metrics. Allow you to talk to business in the language of goals and priorities. Explaining how technical problems can affect users. A special place in working with SLO is occupied by error budget. This is concept that helps establish a balance between stability and the speed of implementing new features. How does it work? Error budget is the alert level of deviations that you can afford to stay within SLO limits. For example, if your SLO is 99 and 9% of trackable time, then your error budget is 0.1%. What does error budget affect? First, it determines priorities. If error budget is almost exhausted, the implementation of new feature is frozen, and the team focuses on solving problems. Therefore error budget can be used as part of sprint planning and to avoid chaotic switching between tasks. Second, it simplifies decision making. For example, you can explain to business that implementing a risky feature now is a DI direct pass to violating your SLA. And third, it help find balance. If error budget is not exhausted, the team can safely experiment with new features result fear of breaking production, and you have already. Notice it that collecting metrics is not enough. This is the main difference between regular monitoring and observability. Metrics should be mandatory dynamically anticipating trends and proactively acting to prevent negative outcomes. When you have just set up Ming, you can answer the question, is the system working as expected? You looked at the medics. Yes, requests are going through no errors, and if the load is increased monitoring will show you an alert. But if your system has the properly property of observability, then you can already answer the question, why does the system work in the way it works? For example, after launch time the load increases and during this time, it's better to not release anything. And this is will help you more when you're analyzing complex incidents. In the end, ator helps understand that something's broke. Observability helps figure out why it's happening. So now that we'll know everything we learned let's turn to our mind question. How can I understand that product works and works properly? Now, we know that this question consists of two parts, how to understand that it works, the part related to monitoring and how to understand that it works properly. The part about observability in general. When I asked myself this question, I thought that I needed just the page in Confluence or a folder with benchmark. From where I can see everything that I want to know about my application. Of course, without ever thinking, I call this list of links, observability, passport. So what's my, so what's an insight of my observability passport? The first things I set up was, of course, an error tracker. In our case, it's Sentry, but it can be absolutely anything. Whatever you like. Even a simple concern log. The main things is that this solution should let you understand that something went wrong and you know how to fix it. Next, after finishing quiz monitoring part, it's time to think about observability. Would it be enough to close the question of errors by catching and analyze them? Analysis is good, but still setting up work in such way that errors do not occur at all is another level. How to achieve this obviously conduct analysis before an error occurs, and that means we need to analyze the court itself. The range of tools for this is simply huge from basic s and test runners, providing cut coverage metrics to static analyzers, detecting vulnerabilities in the app and its dependencies. And if test and coverage are more about monitoring, then finding and analyzing vulnerabilities is more about observability. One of such tools that allow analyzing found vulnerabilities, which we use is the defect Dodger. Defect dodger offers many metrics to get visual information about trends and insights of vulnerabilities in your organization. Similar to 99 and 9% availability targets, defect dodger helps set SLA specifically for resolving issue based on their severity. In our case, obviously the SLO for the number of critical vulnerabilities in production is zero, and the time to fix already found ones seven days, which by the way, the facto suggests by default. Then based on the criticality of the found problem and the basis of your agreements in the team the problem is taken into work and should be fixed within a certain period, which corresponds to your SLA. All right. Coly, this tool vary, but the main idea is using SLAs to manage application stability like API availability centric errors or code big discount it to, it always comes down to the following set of actions. It dify area to monitor API errors, vulnerabilities, and go on. Define error boundaries. What? Devi deviates from normal behavior. Set an error budget, acceptable tolerance level, collect your errors, metrics, and go on. Visualize data in dashboards, set up alerts as needed, and assign owners and define response protocols for incidents. Of course, the observability passport will not complete will not be complete without collecting basic medics such as memory consumption, CPO load, or others. In our case provided by the pro client package, and it'll also not be complete without informative logs from our application. As logs are rec, are records of events. They're easy to create. For example, a simple concern. Log logs give us detailed information about specific types of events that interested us. For example, logging all requests.com to our application or that we make. But logging as a rule happens only within one component of the system for which we are responsible. We write type of intent application, therefore we can add login only here. theEnd might be the responsibility of the completely different team and they made write logs at their purpose. But as you understand, there are cases when error secure and one component of the system is not enough. The reason maybe much deeper to understand and you need the whole picture. This is where traces comes to help. Traces are similar to logs but except that is, is a summary list of all events from all involved components. Often to ensure the correct output of logs from different components. They use some global identifiers that allow taking the pass of the request to all components. Like request ID or slash id. And to collect all this beautifully into a nice graph you can connect tools like Open Telemetry plus Jagger. Next, another related concept of logs is profiling like logs. Profiles are records of some emails within one component only profiles use the most detailed information about what is happening under the hood. Most often profiles are used to analyze bottlenecks along the path of your request. The part of the request that takes the longest to execute is most likely can be optimized to speed up the time of your entire request for profile. We also use senti. We look at similar dashboards. We see which endpoints and pages work the longest and they to profiles for improvement. Okay, updating the passport and the cherry on top is or why is product analytics, how users use your application. That's it. You choose events in your application that you want to track, and based on them, you build a typical user scenarios for your application. Then together with your product owner, you decide what this means and how to improve the user experience. So that they make a decision to buy your product faster, even though product analytics is as far as possible from the technical indicators of the application. We managed to catch several cases in practice for not properly working components. For example the currency switching even integrated rated an extra trigger and we realized that the component was not working correctly. Or for example, an EBIT that should only trigger on the listing page for some reason it was sent on the booking page. We also figured this out and found a problem in the cash. And that's basically it. We add to the past per the SLA for availability, which we talked about earlier, test coverage and swagger. And my passport is ready. Now having such a passport, I can generally find out the domain of my application for a quick checkup. I can track errors by their name, see the number of events for certain period, and receive detailed information. The user data, browser calls, stack request responses, and on I can collect and analyze system logs for quick identification and resolution of problems. I can receive real time indicators of performance and system status, response time, aggregate, or resource usage, and et cetera. I can see the number of potential vulnerabilities, assess the risk and take proper action. I can see metrics collected directed from the container such as the amount of memory used or the number of open connections. I track which service characteristics like availability or latency have SLO define and monitor their performance against those targets. I monitor the error budget, align with SLO targets to balance availability with feature delivery, and I can see the current percentage of code that is covered by unit tests. And I can see the current swag you're describing, the IPI for other services. And that's all. That's all for my site. Thank you for your attention, help you. This observability passport also will help you build your applications. Bye.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability Passport: Navigating the What, Where, and When of Your Frontend App

Video size:

Abstract

Summary

Transcript

Slides

Vadim Tsaregorodtsev

Lead Software Engineer @ ETG

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability Passport: Navigating the What, Where, and When of Your Frontend App

Video size:

Abstract

Summary

Transcript

Slides

Vadim Tsaregorodtsev

Lead Software Engineer @ ETG

Join the community!