AI-Driven Observability for Massive AWS Serverless Workflows

Video size:

Abstract

When a Step Functions fan-out unleashes thousands of Lambdas, cost spikes, latency cliffs, and tracing gaps follow. By streaming telemetry into a SageMaker-hosted graph neural network, you can pinpoint hot branches in real time and let AI auto-tune chunk sizes to keep SLOs and budgets safe.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Greetings, everyone. Welcome to Con 42 Observability 2025. My name is Ravin Kana, a senior software engineer who focuses on cloud native automation and large scale data processing pipelines. Our topic is adaptive parallelism insights, which I'll ABB appreciate as a PIX after this introduction. A PIX combines Open standard Telemetry with a compact graph neural network so that an A WSD function workflow can learn to tune itself while it's running. By the end of this session, you will understand three things. First, how we capture branch level data without rewriting business code. Second, how a small model converts that data into useful guidance. And third, how a closed loop safely applies the guidance to reduce. Latency and cost while keeping every decision auditable to begin, serverless fan outs offer near instant elasticity at three problems surface as scale grows first. Latency clips arise because a few slow branches delay the entire workflow. Even a single hotspot can push the entire service level agreement. Second, tracing gaps appear because high level dashboards average a branch specific anomalies, which means engineers cannot pinpoint trouble quickly. Third, cost spikes emerge when hidden retries. Cold starts infra the monthly without obvious warning. These pains led us to frame three guiding questions. Question one asked how to collect fine-grained telemetry from every branch while adding minimal overhead. Question two considers whether a lightweight artificial intelligence model can digest that stream quickly enough to make timely recommendations. Question three asks. How to apply those recommendations during a live run without breaking confidence limits or financial guard rates. The rest of the talk answers each questions inter contextually. Many enterprise workloads such as data enrichment, ETL, retries and compliance reaction depend on the map state pattern inside AWS Dev functions to fan out toss. Limitations acknowledge the native execution history application programming interface shows only course events, which hides branch behavior that matters. Standardization helps because open telemetry, the leading vendor neutral format for traces, metrics, and logs let us instrument once and analyzed anywhere in parallel. CloudWatch EMF Embedded metric format adds structured counters inside Lambda logs without agents. Research to date often targets virtual machine clusters with reinforcement learning less. Focus has landed on dynamic state machine graphs. Where topology changes every minute opportunity identified graph neural networks accelerate reasoning over graphs whose notes and edges evolve rapidly so they fit our fan out scenario. A PIX therefore unites detailed telemetry open standards, and a compact graph neural network into one practical flows loop control system. The picture at right shows the flow of data and control. Starting at the top, the instrumentation layer adds an open telemetry span to every task. Parallel and map branch spans a carry, chunk ID and the same wrapper logs, counters, such as payload by three. Try count close, cold start time by using CloudWatch embedded metric format. Next, the telemetry pipeline since spans over the open telemetry protocol to a collector running on AWS four gate. The collector passes real time streams into Kinesis Data File Host, which fuels live dashboards in Amazon managed Grafana, and it also stores raw records as Parquet files in Amazon. Simple storage surveys for deep analysis. Moving downward sales maker Serverless hosts a two layer graph neural network. With about 1.2 million parameters every 60 seconds. The model consumes the latest window of branch metric and predicts an ideal chunk size plus confidence and projected cost and latency impact. Even bridge then delivers third influence inference to a Lambda callback. The callback applies guardrail checks and if they pass patches, the running map state through the step function update application programming interface. Finally, governance is preserved because each inference has hashed, stored in systems managed parameter store, and according into subsequent spans, giving auditors a full lineage from decision to outcome, automation must be safe. So the controller enforce three boundaries before any change. First, the chink change window limits are adjustment to 25% within five minutes. Which prevents oscillation. Second, the concurrency cap ensures the new chunk size keeps total lambda concurrency below the account quota minus a safety buffer re retrieved from parameter store. Third, the rollback trigger monitors 95th percentile latency and total errors if both rise for two consecutive windows, the controller reverts to the prior Chang size. Together these safeguards let the model explore improvements while guaranteeing that production stability and cost limits remain intact. We measure first because insight depends on data quality. Span emission adds about two millisecond per invocation, negligible overhead for data processing tasks that often run for seconds after capture the open telemetry collector. Batches spans, pushes them to can assist data file hose and writes fully ity regards to Amazon. S3 visualization follows where managed Grafana converts those streams into live heat maps highlighting slow or throated branches within seconds. Flexibility matters and because every interference interface follows open standards, teams can replace Grafana with another dashboard or root. Par K files into S3 for any analytical platform or without revising revisiting application code. XP experience shows that this separation accelerates debugging and gives finance teams precise cost attribution tied to trace identifiers. Evaluation used three workloads. Synthetic uniform contains 10 terabytes of uniformly sized one megabyte ard. Synthetic skewed SS also spans 10 terabytes, but injects 5% larger records up to 10 megabytes. Two fourth of slower processing live enterprise entity data. For example, like contractual data represents three terabytes of anonymized production updates. Baseline compared were static patching at 1000 items. Dynamic sub three partitioning that splits branches about two gigabytes and a PIX adaptive chunking metrics. Metrics tracked include 50th. 90th and 95th percentile latency OB cost in United States dollars. And throttle error counts reading the table. Adaptive chunking shows the lowest latency across all the percentile and the lowest cost for every workload. For synthetic skewed throttle errors drop from 47 under static patching to F five under API X, demonstrating resilience against uneven regard sizes. Live Enterprise data follows the same trend with active chunking, cutting both dry tile latency and cost while reducing throttles from 24 to three. In summary, real time tuning delivers consistent performance gains and tangible cost savings. Five lessons for anyone who wants to replicate EPX first Sample generously capturing every span during peak windows, so retry storms are visible down sample later to trim storage if needed. Second, activate guardrails because the 25% cap on chunk changes smooths workload burst and prevents isolation. Third connect costs. Publishing dollar Deltas alongside trace identifiers quickly gained trust from finance stakeholders. Fourth, inspect retry patterns since high retry counts, often pinpoint hard partitions or downstream rate limits. Pair re try count with chunk identifiers made those hotspot obvious Fifth. Retrain weekly because as new entity types enter production, the graph neural network must adapt to maintain prediction accuracy. Collectively, these practices turn a PIX from prototype into a production service adapted by multiple teams. Every study has limitations, cold start, variability per persist. Even with Snap start, residual jitter can influence daily latency and may exaggerate gains. Synthetic bias is possible because synthetic workloads, while realistic may overlook extreme traffic patterns found in the divide for future work will replay full production traces to close that gap. Model rate remains risk. Shifting data distributions can erode predictions quality. So automated drift alerts trigger retraining whenever crosses a person threshold. Recognizing these limits helps others decide whether A PIX fits their environment and guides our roadmap for continued validation. In closing, adaptive parallelism insights turns detailed. Telemetry into timely recommendations that update step functions, workflow safely benefited, delivered teams gain faster executions, lower compute bills, and complete audit trail without changing business logic. Next steps we intend to coordinate turning across multiple workflows that share concurrency limits and explore web assembly based span features extraction to reduce overhead even further. Appreciation. Thank you for your attention. Your questions and feedback through conference forums are welcome and I look forward to hearing how you apply adaptive observability in your own pipelines. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

AI-Driven Observability for Massive AWS Serverless Workflows

Video size:

Abstract

Summary

Transcript

Slides

Vamsi Praveen Karanam

Software Development Engineer @ Amazon

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

AI-Driven Observability for Massive AWS Serverless Workflows

Video size:

Abstract

Summary

Transcript

Slides

Vamsi Praveen Karanam

Software Development Engineer @ Amazon

Join the community!