Conf42 Observability 2025 - Online

- premiere 5PM GMT

Function vs. Friction: Designing the Right Observability Stack in AWS Cloud

Video size:

Abstract

Great observability doesn’t shout—it quietly enables clarity. This talk explores how to evaluate, select, and integrate AWS observability tools so your stack disappears into the background and data-driven decisions take the lead.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm pr data solution architect with Quantified. So I have over 10 years of experience in designing cloud native analytics platforms for several different customers specifically on AWS. And in working with customers from different domains, I've repeatedly seen a common pattern, which is services are often were misapplied. So specifically when it comes to observability, it leads to a lot of bad outcomes, which is it leads to blind spots, alert, fatigue, balloon cough, and mis. So today at QU Con 42 we will keep it at a very high level and focus mainly on real world monitoring and observability use cases. In today's cloud first managed services promises, agility, accountability, and reduced operational overhead. And these services have become the foundation of modern digital and data architectures. And the ease of adoption often masks a very critical risk, which is misalignment between the service capabilities and actual workload requirements. So when cloud services are selected without clear understanding of their intended function in context, they can definitely introduce long-term friction and it costs teams valuable time, capacity, and financial resources so that by the end of the session, we will have a clear mapping of the requirement tool choices, so that we never over or under provision our telemetry stack. And it addition framework that will help us wa it and see complexity and granularity and best practices and common pitfalls to watch out as we build and optimize our observability layer in AWS. So what is observability? So observability is nothing but the measure of internal state of the system. By examining outputs such as logs, metrics, and traces, it's not about just monitoring. It's about deep insight and understanding the system. And its main purpose is to empower teams to direct, diagnose, and resolve issues faster. So it helps improve reliability, performance, and user satisfaction. So when it comes to cloud platforms, cloud and data platforms, we have a lot of microservices, serverless functions, distributed pipelines, and observability is connected tissue that holds the modern data architectures together. So without it blind spots emerge lead to increasing downtime. User frustration and loss of trust. So from a strategic standpoint, observability is the foundation for cloud analytics platform and it enables real time decision making, compliance assurance, and proactive optimization. So here are some core pillars of observability. So one is logs second is metrics, and third is traces. So let's understand each one of these. Locks are nothing but in immutable records of events that have occurred within the system, such as error messages, user actions, state changes it locks provide granular, timestamped context and are vital for post-incident analysis and auditing. When it comes to metrics quantitative measurements taken at interval, for example, we take CP usage. So it's a quantitative metric that 80% is utilized. Memory, it's a quantitative metric. Request counts error rates, right? These are all, something we call metrics and these are compact and efficient for trend analysis. How's the CCP utilization over time? How's the performance over time? So this is something the metrics are going to be really helpful in such scenario. And the last one is traces. So tracers show the journey of a single request for example, across various services. So it catches the latency. And this is very critical for understanding complex and interconnected systems. And it'll definitely help with the root cause analysis. So let's understand some of the core services in AWS. The very first one is CloudWatch. So when it comes to capabilities it is used for lock collection. Metric aggregation customizable dashboards for alarms and automated actions. The general use case is general purpose monitoring, alerting, and automated responses. So CloudWatch is a foundational observability tool provided by AWS. So it'll allow us to gather all these logs, metrics and define these dashboards. We will also have an ability to configure customer alarms, proactive alerts, and automated remediation. So it's a versatile and integrate seamlessly across various AWS services. Making it very ideal for monitoring and basic troubleshooting. So when it comes to x-ray end-to-end tracing requests through distributed application detail, insights into microservice interaction, and it's mainly used for debugging and performance optimization of these microservices so we can understand. Which microservices is taking more time versus the one that is acting up very quickly. So it's a very helpful service and offering that AWS has. So the next one is the managed Grafana. Capabilities wise, it's very rich in terms of visualization. It provides seamless integration with multiple AWS services. As well as multiple open source databases, it'll provide real time dashboards and analytics. So majorly this managed Grafana is used for real time visualization of system metrics and logs for operational insight. The next one is a BI offering. So QuickSight, it's it's called QuickSight. It's a very popular tool and, the cab are, it's a bi service offering. It's not really an observability tool, but it provides interactive dashboards in insightful analytics and natural language query processing using QuickSight queue. So the use case wise, it is used for business intelligence and analytics reporting with periodic data refresh. It's not real time though. And the last one is restore for Open Telemetry. So capabilities include standardized instrumentation and data collection across various platforms. So it's a vendor neutral broad comparability. So it provides a unified observability across our diverse application stacks. So here is another list of expanded AWS observability tools. The first one is cloud rail. So CloudTrail audits every a PA action in AWS account essential for governance compliance tracking and tracking changes. So it locks any unauthorized attempt to access something or and it keeps the log of various services interacting with each other, or, the users trying to access certain services. It'll try to keep track of all those actions. AWS config on the other hand continuously monitors the resource configuration. It ensures adherence to the regulatory standards and DevOps crew. It uses ML to detect anomalies in the application, suggesting root causes. And remediations, it's a very helpful tool and trusted advisor. It evaluates our AWS setup. Again, as best practices for cost, security and performance and provide recommendations compute optimizer so it recommends the most efficient compute a resource based on our usage patterns. And finally, inspector, it automates the the security scans on EC2 Lambda Uncon container workloads. So let's understand now some architecture patterns. Centralized telemetry pipelines. So we collect a store process observability data from various sources in a centralized manner using services like S3 and open search. And when it comes to real time monitoring, so let's say if you want to enable live dashboards. And alert alerts using tools like Grafana and CloudWatch for immediate visibility into system performance. In this scenario, like Grafana and CloudWatch stands out. Along with that, we can use multiple other services, which can be easily integrated to send notifications like S-N-S-S-E-S, et cetera. And for AI ML integration. So it's recommended to leverage AWS SageMaker and bedrock to apply machine learning models for prior to analysis on automatic automated diagnostics and anomaly detection. So for automated response implement automated remediation and scaling using AWS Lambda and eBridge, it'll help us quickly address, issue, and maintain the reliability. As it is mentioned a robust observability architecture, centralizes the data powers real time monitoring and leverages ai and automate the response for resilience and scalability. So now let's understand what is function and what is friction. So it's a framework that we should keep in mind when. Selecting cloud managed services and it's very important to follow this approach, right? So function is when tools we pick, accelerate, insight, and integrate seamlessly at scale. So observability tools empower teams by providing clear, fast, and aligned insights. On the other hand, friction is when. Tool introduces the noise lag and complexity. So it often come hinders team ability and causes delays and confusion and ultimately a higher cause. So looking through function versus friction lens, it'll help teams to choose observability tools that drives the operational excellence by maximizing the insights and minimizing the complexity. Now let's take a look at the examples of function. So the very first one is using CloudWatch alarms for real time alerts. So it's a very good function. So it's recommended and best practice. To use cloud wash to collect metrics logs and visualize the data using cloud wash data dashboards. And we can also have trigger alarm for EC2 monitoring Lambda metrics, container insights and cost tracking. So it's a very helpful tool in this context. Cloud wash definitely acts, adds a function and it's, whenever we have such use case, it's recommended to go with CloudWatch. And the next one is X-ray for pinpointing service bottlenecks. So let's say we have a web of microservices that's that are, and we don't know, like what is taking more time. So in such scenarios x-ray provides a distributed tracing for microservices. And it'll help us debug the performance issues. Traces API, gateway lambda flows and visualize the service maps. So it's a great tool and it'll help identify bottlenecks very easily. So it's another example of function and my favorite one is Grafana Dashboard for live performance views. So not only just for observability, but for monitoring. Iot many other use cases. So Grafana or Managed Grafana enables live dashboards from cloud Wash Prometheus and several data sources. You can name it. Used in N Cs solar tracking service level objective and anomaly detection. And open telemetry for unified instrumentation. It provides standardized instrumentation across applications like E-C-S-E-K-S or hybrid environments, sending data to CloudWatch or third party tools. So these tools overall reduce time to detect and time to resolve provides actionable and high value insights that drive the operational excellence. Let's see some examples of friction. So let's say if you use QuickSight for realtime monitoring, right? So QuickSight is not a real realtime tool that we can use. So using QuickSight for realtime monitoring, definitely there is going to be a lag. And it, it is due to the nature of QuickSight batch processing. Approach that it makes it un unsuitable for time sensitive monitoring and scaling CloudWatch logs without structured queries, acquiring challenges due to unstructured log data, leading to reduced visibility and insights and, overlapping tools without integration. So tools, sprawl causing fragmented data and complexity, and it hinders collaboration and efficiency. Friction arises when tools are misapplied, architectures, fragment, or signal overwhelm teams. So it leads to delays confusion and higher cost. So what is the role of genis and observability? So today there is a lot of hype about AI and it definitely has a role in observability. So one is credit to analytics, to forecast issues. That's more of a machine learning. So we use machine learning to predict problems and system failures before they occur, and enable proactive remediation. And automated diagnostic tools suggest root causes. So we can use the AI powered analysis of logs, metrics, and traces to quickly identify root causes of issues and can help accelerate troubleshooting and resolution. And we can also use natural language processing to simplify inserts. Allow users to ask questions about system performance and behavior in plan. So it'll help us generate tailored visualizations and reports. And then smart alert alerting to reduce noise. So we use machine learning to intelligently filter and prioritize alerts to reduce alert static and ensure teams focus on most critical issues. So Gen AI or machine learning when it comes to observability they are ready to foresight, accelerate diagnostics and enabling simpler, more intuitive access to the insights. So earlier when we discussed function with friction, we discussed and understood how choosing the wrong tool can impact overall objectives, right? So here we'll discuss, some of the common pitfalls and how we can mitigate them. Siloed tools. So observability tools that are not integrated leads to fragmented data and reduced data visibility. So it'll cause all fat. Overwhelming number of alerts, distract and desensitize teams, and they'll hinder effective response And reactive troubleshooting, relying on manual time consuming process to diagnose and resolve issues rather than proactive prevention. And addressing this common pitfalls upfront ensures observability and strengthens the operation without overwhelming teams through centralized pipelines, automated alerts, and design time observability integration coming to best practices. So it's important to implement observability early. Always incorporate observability capabilities into your architecture and design process from the start rather than bolting it later. So align SLOs and SLIs service level objectives and service level indicators to the business goals. Ensure observability metrics and targets are closely tied to the KPIs that drive the organization's success. It's very important. And leverage automation wherever possible, and machine learning. Automating observability tasks like a large generation root cause analysis help optimize and improve efficiency and accuracy. And last but not least focus on continuously improving reviewing and optimizing the pipelines. So regularly assessing the pipelines or tools and processes to identify areas for improvement and ensure they remain relevant. So observability thrives when baked into the architecture, so aligned to the business value, consistently refined, and for efficiency and relevance. In this slide, we'll talk about real world use cases, various use cases are applicable for I'm just discussing a few of them, which is manufacturing iot. So help monitor sensor sticker, predict to maintenance optimized uptime. So for financial platforms, trace transaction, direct fraud, and mantra e-commerce track conversions, right? So monitor customer journeys, AB test performance. Many other things. Real world example shows how observability drives operational, financial, and customer outcomes from factory floor to the online storefront. So in summary, observability enables scalable, resilient, and reliable systems. It's very crucial for modern data and cloud native systems. It'll enable teams to maintain reliability, performance, and trust. And so AWS offers powerful suit of observability tools like CloudWatch, X-Ray Grafana that provide diverse per perspectives on system health. And from a best practice standpoint we can definitely rely leverage gen ai. To a strategic advantage by following observability best practices and leveraging power of ai. And machine learning organizations can bring strategic strategic advantage through real time insights spread to analytics and automated response. So it is very important for modern cloud native system and AWS provides a comprehensive observability tool sets and by. Combining it with best practices that latest AI capabilities. So yeah. So finally as we conclude, so observability in AWS it's very not just AWS but any cloud offering, it's very important. It's a very crucial capability for modern data architectures and solutions. It enables teams to maintain reliability, performance, and trust. And using some of the, approaches, recommendations, and best practices we can build scalable and res resilient and reliable systems that can drive strategic advantages through real time inserts prior to analytics and automated response. Thank you.
...

Prudhvi Raj Atluri

Associate Architect - Data @ Quantiphi

Prudhvi Raj Atluri's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)