Resilience: Harnessing Chaos Engineering to Build Robust Systems

Video size:

Abstract

Discover how chaos engineering revolutionizes system reliability by proactively testing failures in real-world conditions. Learn proven practices to build resilient, scalable systems, minimize downtime, and fortify your infrastructure against unexpected disruptions. Resilience starts with chaos!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, everyone. Today I'll be presenting on cloud native observability, enhancing monitoring and performance. So when we think of cloud, applications, the first thing that come into our mind is SAS software, probably, high volume internet sites like YouTube, Netflix. so in order for these services to be delivered to the customers, the users. efficiently. Constantly behind the scenes, there are a ton of engineers that are continuously monitoring these services. they're trying to identify potential bottlenecks, optimizations, and, Basically trying to see what the health of all these systems are as the systems mostly run on at a data center delivering these services. So let's get into some details of what I'll be presenting today. So we'll go over. What is cloud observability? Why is it important? Some key observability tools and technologies. What are the benefits? Best practices? We'll go through some real world use cases. What are our current challenges? And, what lies ahead? What is the future of cloud observability? Cloud observability, sorry. what is cloud observability? Cloud native observability? The definition is Observability is the ability to measure the internal state of a system based on its outputs. In cloud native environments, it involves logs, metrics, and traces. The key components are metrics, which is insights on how a system is performing. Logs is constant events of What the system is running or what it's doing. Traces is the overall flow of that system. How, how it's part of the overall topology. Where are the requests coming in? Where is it going? So in a nutshell, it is a way to gather logs, metrics and traces from a system and using it for monitoring purposes. So why is observability important in cloud native? in general terms, applications, cloud native services. So modern services are mostly microservices as. Microservices keep evolving, they start getting complicated. Lots of APIs get embedded into it, lots of, CRUD operations. there is some level of, complexity that, kicks into, these overall architectures with all the interdependencies across, different microservices based on their, distributor architecture. in order to get a visibility into those kind of architectures, observability is really important so that we exactly find, any performance issues at a specific, microservice level or a module. the next one is real time monitoring. So overall, observability is key to identify potential issues in the system. it helps us gain insights of how the system is actually performing, gets, gives us a lot of visibility onto, the application based on the metrics or, visibility that has basically been configured on different kinds of applications. Better visibility gives us faster, issue resolving capabilities. So that's another reason why observability is, so important. covering scalability and performance, it ensures smooth auto scaling and resource allocation. So as we constantly monitor these, services, servers, applications, we have better insights. we are better equipped with making better decisions on how to leverage. if the system is constantly running at a high load. It helps us make a decision to, auto scale, to more, to beef up the, to beef up the system more, in case of low utilization. We can always ramp down. So it helps us gain a lot of visibility that eventually helps to design the system better on the scalability aspect of the architecture as well. Same goes with improved debugging. With modern application, modern services, there is tend to be constant issues, performance issues that we try to identify, bugs, probably any customer issues, latency related issues. So with improved monitoring, it helps engineers to identify the issue. So it helps reduce the mean time to identify the issue as well as to resolve the issue eventually. so the turnaround to issue identification to delivering back those bottlenecks issues to the customer is much more faster with, cleaner visibility or, higher visibility into these services or applications. So some of the key observability tools and technologies today in the modern world is Prometheus. It's primarily used to scrape metrics from different systems applications. So it collects metrics as well as it helps to monitor those metrics. Grafana is another open source, library or, visualization that is really popular, getting popular with lot of, third party plugins integration. So it primarily helps in, creating visualizations, Based on the metrics that's in, some of these, backends like Prometheus or Influx db. So it helps us query, those metrics continuously have some alerting in place to allow the user in case there are, some anomalies or something that exceeds a certain, threshold. One thing that's becoming popular is with OpenTelemetry. Earlier, pre OpenTelemetry, there were multiple vendors in the market. Everybody had their own API or own agents that were Constantly scraping metrics or, collecting metrics from the applications or systems, building a schema that is vendor specific and sending it to their own backends. So once the user or once the organization is, Is starting to use that specific vendor. They're pretty much married to that vendor So in order to port to migrate to a different customer to different vendor, sorry It's extremely hard. There is very limited vendor portability. So as an industry everybody came up with some Standards on how we collect not how we collect some of these metrics at the same time how we reported back onto the back ends. So the back end has a defined schema, which is what is included in open telemetry. It also provides a standardization for tracing across different, applications or different tiers of an application, to overall stitch, how the request is flowing from one tier to another or one system to another system. and. Giving us a lot more insights into, the overall end to end picture. So instrumentation is actually, the process to instrument the applications via agents. So OpenTelemetry is really popular. So Eager is again, very popular with OpenTelemetry. it is one of the backends where, when you use the agents, you can export it back onto the Ager's platform. So different applications, once it reports into, the Ager exporter, Ager is, has a really smart way to constitute all those different backends and present an overall end to end picture of how the request is actually flowing within the system, trying to, create those dependencies. onto the system. Next is Elastic Stack. It's mostly on log aggregation and analytics. Helps in collecting the logs, providing some analytics on top of it. regarding unified monitoring, the centralized visibility across cloud native environments is very important. proactive issue detection Adaptable to dynamic workloads, especially on scalability front is very important Also when we say cloud native it has to be very DevOps friendly because modern cloud native Software is being delivered iteratively or every two, three weeks, maybe once a quarter. So it has to be DevOps friendly. So there should be some kind of continuous integration, continuous deployment model. So the best practice in that is use an open standard, which is open telemetry. Use some Grafana dashboards. For visualizations, have some automation in place to, to find anomalies, based on the trend that is being observed on the metrics. Any spikes, anomalies should be quickly detected and reported back. to potentially find any bottlenecks or performance issues. So some of the real world use cases, I said, as I mentioned, Netflix uses open telemetry for the distributed tracing and microservices. So all the microservices. Interdependencies is measured using open telemetries to, to identify potential bottlenecks in across, handling the overall requests at which specific microservice there is potential issue. Same with Uber, it implements Ager for end to end transaction monitoring. Airbnb uses Prometheus and Grafana for real time performance monitoring. Challenges. These systems keep getting bigger and bigger. There are more number of metrics being available on, on a yearly basis. It keeps increasing. So there is huge volumes of telemetry data that gets exported. So the cost goes up, as well as different cloud environments has. It has different, ways of presenting the data, different kinds of metrics. So there is standard on the collection part, but there is no standard on the visualization front. So that is. Future trends, AI driven anomaly detection is a big thing right now. most of the observability organizations like vendors are primarily focused on, on, on how to leverage AI, to use. AI has a pass through for every metric or have some kind of machine learning models to, to constantly scrape data, analyze, and then, report any anomalies. So that is excited to see how that's going to shape up. Similar with more widespread adoption of eBPF for deep observability, as well as increased integration of observability with security tools. So when it comes to security services, not many. Not much of it is, observable or there is a lot of restrictions. So in the future, how it turns out for especially security related tools is going to be something to watch out. So that's pretty much it. Thank you so much.

Slides

Download slides (PDF)

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Resilience: Harnessing Chaos Engineering to Build Robust Systems

Video size:

Abstract

Summary

Transcript

Slides

Sai Prakash Narasingu

Senior Staff Engineer - Cloud Observability @ ServiceNow

Join the community!

Featured event

2026

2025

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Resilience: Harnessing Chaos Engineering to Build Robust Systems

Video size:

Abstract

Summary

Transcript

Slides

Sai Prakash Narasingu

Senior Staff Engineer - Cloud Observability @ ServiceNow

Join the community!