Conf42 Observability 2025 - Online

- premiere 5PM GMT

Navigating the Digital Landscape with Intelligent Full Stack Observability

Video size:

Abstract

Decision-making is shaped by data, and real-time data facilitates prompt decision-making. Simply monitoring and reacting to failures falls short. Hence, a strong strategy for full-stack observability, coupled with AIOps, is increasingly crucial to proactively prevent recurring failures in the future

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning. Good afternoon, everyone. This is Man Kashkar. I'm a principal architecture at ThoughtWorks India. I'm really excited to be here to walk you through how organizations can navigate today's complex digital landscape using intelligent full stack observability as systems scales, and expectations rise, observability. It's no longer just a DevOps concern, but it is a business imperative. Let's understand the cost of poor observability. We often hear that visibility is everything, but in reality, most enterprises only see a fraction of their landscape despite having multiple tools. As per the state of observability report, 89% of organizations admit they still lack. Complete visibility to their systems, tools, sprawl, long mtt and exploding telemetry volumes are all driving operational inefficiencies and rising cost. The question is no longer whether to invest in observability, but it is more about how to do it smartly. Hence, I request everyone here to understand why it is important to have your systems observed. Each and every time, please treat observability as a first class citizen. This isn't just about collecting logs or metrics. It is about understanding the why behind your system's behavior in real time, and more importantly, predicting what it could evolve into. This is a shift from monitoring to insight driven decision making. Think observability as a layered strategy. It spans across applications, infrastructure, business, and even user experience domains. The melt stack metrics, events, logs, and traces, they act as the foundation. To build a strong and robust observability framework, the real impact happens when you correlate these pillars across the domain in real time. Let's also understand the observability maturity as you see here. This slide reflects a critical shift in how organizations evolve their observability approach. Early maturity stages often start with siloed monitoring. Teams track the metrics and logs, but the insights are isolated. They were lacking the context as well as domain information. As maturity progress, we see a convergence of telemetry with real business KPIs in terms of evaluating and tracking and monitoring the uptime, the revenue impact, the customer churn rate, et cetera. Full stack observability, it represents the pinnacle, a model where. Infrastructure applications, digital experience, and business context, they all are stitched together. This maturity enables not just technical diagnostic, but also insights into the business impact of technical degradation. Like for example, a 40 milliseconds increase in response time will cause 2% drop in conversation. This is a concerning point for your business. Let's check out this journey as a case in point. This is a real world example about a customer journey, like an e-commerce platform, who wants to check out some of the goods in the products, and it also emphasizes the importance of having dual lenses, the outside in perspective and the inside out. When we talk about outside in. It actually captures how the user experiences the application, whether you have slow pages, failed transactions, high latency, and when it comes to the insider out perspective, it surfaces systems, site telemetry, backend response times, your service response time, database query lags, database query, performance memory spikes, or API gateway bottle links. The true observability connects these two views in real time. For instance, a spike in checkout may be directly lied, may be directly tied to a database locked timeout, or a failing third party payment, API. Without that correlation, you are left guessing and it'll take lot of time for you to identify the root cause. You are enabling precision recovery and business aligned decision making, and here it's AWS who provides us multiple such services, which could help you to achieve the full stack observability. AWS provides a robust observability toolkit, which values in how you stitch these services together. CloudWatch, it'll help you to capture native metrics and logs across various compute systems. X-ray adds distributor repressing essential to follow a request across microservices, CloudWatch synthesis. It mimics user journeys to detect issues before the customer do. CloudWatch, RUM, the real user monitoring. It provides the frontend visibility, user latency, and core web vitals. The opportunity here is an integration building composite dashboards that unify telemetry across various layers. For example, correlating CloudWatch alarms with x-ray presses. We'll help you to isolate bottlenecks automatically. This will help you transform raw signals into actionable insights and eventually reducing MTTR significantly. Let's also quickly understand how you can target to achieve AI ops and how the journey will look like with observability at scale comes the data delay. Terabyte of LAX thousands of metrics and alert for this is where AIOps step in AIOps is a natural evolution. Applying machine learning and automation to operational data to detect patterns, surface anomalies, and even auto issues. Think of it as observability with intelligence. We are seeing a shift from reactive alert to productive signals spotting anomalies before users are impacted. This journey to AI ops isn't overnight. It starts with robust observability data layer with smart correlation and gradually introducing automated decision making. Now, let's understand. To achieve this, what are the components that the organization should focus on? AIOps operates across three strategic pillars. First is observe, ingest, clean, structured telemetry across the melt. Learn. Apply ML services, AI models to detect any outliers, understand the seasonality and predict failures. The third is act automate downstream workflows, alert suppression, ticketing, dynamic scaling, and even rollbacks. This is where the value multiplies. For example, if CPU usage spikes and is constantly followed by database timeouts, AIOps will learn that. Relationship and suppress noise while highlighting root cause. This is the transition from alerts to intelligent signal management. Now let's also understand the AI ops. What are the various tools and the capabilities that are provided by AWS? The key services includes. DevOps Guru. It is a fully managed ML powered insight engine that detects operational issues well before the impact users. Next is CloudWatch anomaly detection. This is a beautiful service which auto detects anomalies in time series metrics. The third, an very important one, is X-ray insights. It correlates traces with error rates and it highlights the slow segments. Each of these tools is very valuable individually, but when combined, they form a very powerful AI ops pipeline. For instance, DevOps guru can detect a performance anomaly, link it to the failing express segment, and triggers an SNS alert that upgrades your incident management system automatically. Now let's understand. How can you optimize your it, IT operations using automation? With AI ops, we go beyond dashboards and alerts. We enter into a world of intelligent alert routing, noise reduction, contextual correlation, and automated remediations. This will result into turning operations from reactive firefighting mode into a very proactive governance. Now, let's see, how can we streamline the operations? This is a journey which can be achieved in a very phased manner in order to optimize the operations. You should definitely leverage on the AWS AI stack or something similar, which other cloud vendors provides. Now let's also understand another case about how can we reduce the incident volumes as a part of your day-to-day BAU when it comes to the operations or maintenance support activities. Your organization here is a step by step playbook about how can you achieve this instrument, everything. Ingest with structure, identify the patterns, drive predictive alerts, and connect it to your ITSM systems. The goal is. Fewer tickets, faster resolution, and a major able uplift in terms of customer and business experience. Here is a case in point about anomaly detection framework. This isn't just about catching failures, but it is more about preventing them. For that, you need to train your ML models on very relevant logs, metrics and traces. Your system should be able to correlate and identify a meaningful contextual inferences out of the data that has been provided to you. This will help you eventually to degrade your lots of surface issues that you typically identify in the operations, and once you apply this. This is your value realization phase by instrumenting telemetry across systems, detecting patterns via ml and integrating them into the workflow. The incident volumes will surely drop significantly. You are no longer drawing in alerts. Instead, you are seeing fewer tickets, faster root cause identification, and measurable business impact. Hence observability and AIOps together will enable a continuous impro improvement loop for your organizations where every incident become training data or smarter operations. While we have spoken and understood lot of good things about AI ops and what are the various ways to achieve it, AI ops also comes with certain challenges. Some of the common challenges identified are the data quality, integration complexity, and the change resistance. However, if you see the compelling ROI that you could achieve by implementing AIOps such as improved efficiency, faster MTTR and better customer satisfaction. When AIOps done right, it will really help you to transform not just it, but your entire business. Some of my closing thoughts here are the intelligent observability is not a luxury. It's a foundation for high performing digital enterprise. From real time user experience to deep backend visibility, from animal detection to AIOps driven automation. The journey starts with visibility, but ends with intelligence. Let's build systems that not only recover fast, but adopt, predict, and self fail. Thank you for joining me today for this session. I hope the session you find insightful. Happy learning to all the audience.
...

Manik Kashikar

Google Cloud Practice Lead @ Thoughtworks

Manik Kashikar's LinkedIn account Manik Kashikar's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)