Abstract
When a Step Functions fan-out unleashes thousands of Lambdas, cost spikes, latency cliffs, and tracing gaps follow. By streaming telemetry into a SageMaker-hosted graph neural network, you can pinpoint hot branches in real time and let AI auto-tune chunk sizes to keep SLOs and budgets safe.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Greetings, everyone.
Welcome to Con 42 Observability 2025.
My name is Ravin Kana, a senior software engineer who focuses on
cloud native automation and large scale data processing pipelines.
Our topic is adaptive parallelism insights, which I'll ABB appreciate
as a PIX after this introduction.
A PIX combines Open standard Telemetry with a compact graph neural network
so that an A WSD function workflow can learn to tune itself while it's running.
By the end of this session, you will understand three things.
First, how we capture branch level data without rewriting business code.
Second, how a small model converts that data into useful guidance.
And third, how a closed loop safely applies the guidance to reduce.
Latency and cost while keeping every decision auditable to begin,
serverless fan outs offer near instant elasticity at three problems
surface as scale grows first.
Latency clips arise because a few slow branches delay the entire workflow.
Even a single hotspot can push the entire service level agreement.
Second, tracing gaps appear because high level dashboards average a branch
specific anomalies, which means engineers cannot pinpoint trouble quickly.
Third, cost spikes emerge when hidden retries.
Cold starts infra the monthly without obvious warning.
These pains led us to frame three guiding questions.
Question one asked how to collect fine-grained telemetry from every
branch while adding minimal overhead.
Question two considers whether a lightweight artificial intelligence
model can digest that stream quickly enough to make timely recommendations.
Question three asks.
How to apply those recommendations during a live run without breaking confidence
limits or financial guard rates.
The rest of the talk answers each questions inter contextually.
Many enterprise workloads such as data enrichment, ETL, retries and compliance
reaction depend on the map state pattern inside AWS Dev functions to fan out toss.
Limitations acknowledge the native execution history application programming
interface shows only course events, which hides branch behavior that matters.
Standardization helps because open telemetry, the leading vendor
neutral format for traces, metrics, and logs let us instrument once
and analyzed anywhere in parallel.
CloudWatch EMF Embedded metric format adds structured counters
inside Lambda logs without agents.
Research to date often targets virtual machine clusters with
reinforcement learning less.
Focus has landed on dynamic state machine graphs.
Where topology changes every minute opportunity identified graph neural
networks accelerate reasoning over graphs whose notes and edges evolve
rapidly so they fit our fan out scenario.
A PIX therefore unites detailed telemetry open standards, and a
compact graph neural network into one practical flows loop control system.
The picture at right shows the flow of data and control.
Starting at the top, the instrumentation layer adds an
open telemetry span to every task.
Parallel and map branch spans a carry, chunk ID and the same wrapper logs,
counters, such as payload by three.
Try count close, cold start time by using CloudWatch embedded metric format.
Next, the telemetry pipeline since spans over the open telemetry protocol to
a collector running on AWS four gate.
The collector passes real time streams into Kinesis Data File Host, which
fuels live dashboards in Amazon managed Grafana, and it also stores
raw records as Parquet files in Amazon.
Simple storage surveys for deep analysis.
Moving downward sales maker Serverless hosts a two layer graph neural network.
With about 1.2 million parameters every 60 seconds.
The model consumes the latest window of branch metric and predicts an
ideal chunk size plus confidence and projected cost and latency impact.
Even bridge then delivers third influence inference to a Lambda callback.
The callback applies guardrail checks and if they pass patches, the running
map state through the step function update application programming interface.
Finally, governance is preserved because each inference has hashed, stored in
systems managed parameter store, and according into subsequent spans, giving
auditors a full lineage from decision to outcome, automation must be safe.
So the controller enforce three boundaries before any change.
First, the chink change window limits are adjustment to 25% within five minutes.
Which prevents oscillation.
Second, the concurrency cap ensures the new chunk size keeps total
lambda concurrency below the account quota minus a safety buffer re
retrieved from parameter store.
Third, the rollback trigger monitors 95th percentile latency and
total errors if both rise for two consecutive windows, the controller
reverts to the prior Chang size.
Together these safeguards let the model explore improvements while
guaranteeing that production stability and cost limits remain intact.
We measure first because insight depends on data quality.
Span emission adds about two millisecond per invocation, negligible
overhead for data processing tasks that often run for seconds after
capture the open telemetry collector.
Batches spans, pushes them to can assist data file hose and writes
fully ity regards to Amazon.
S3 visualization follows where managed Grafana converts those streams into
live heat maps highlighting slow or throated branches within seconds.
Flexibility matters and because every interference interface follows open
standards, teams can replace Grafana with another dashboard or root.
Par K files into S3 for any analytical platform or without
revising revisiting application code.
XP experience shows that this separation accelerates debugging
and gives finance teams precise cost attribution tied to trace identifiers.
Evaluation used three workloads.
Synthetic uniform contains 10 terabytes of uniformly sized one megabyte ard.
Synthetic skewed SS also spans 10 terabytes, but injects 5%
larger records up to 10 megabytes.
Two fourth of slower processing live enterprise entity data.
For example, like contractual data represents three terabytes
of anonymized production updates.
Baseline compared were static patching at 1000 items.
Dynamic sub three partitioning that splits branches about two gigabytes
and a PIX adaptive chunking metrics.
Metrics tracked include 50th.
90th and 95th percentile latency OB cost in United States dollars.
And throttle error counts reading the table.
Adaptive chunking shows the lowest latency across all the percentile and
the lowest cost for every workload.
For synthetic skewed throttle errors drop from 47 under static patching
to F five under API X, demonstrating resilience against uneven regard sizes.
Live Enterprise data follows the same trend with active chunking, cutting
both dry tile latency and cost while reducing throttles from 24 to three.
In summary, real time tuning delivers consistent performance
gains and tangible cost savings.
Five lessons for anyone who wants to replicate EPX first Sample generously
capturing every span during peak windows, so retry storms are visible down
sample later to trim storage if needed.
Second, activate guardrails because the 25% cap on chunk changes smooths
workload burst and prevents isolation.
Third connect costs.
Publishing dollar Deltas alongside trace identifiers quickly gained
trust from finance stakeholders.
Fourth, inspect retry patterns since high retry counts, often pinpoint hard
partitions or downstream rate limits.
Pair re try count with chunk identifiers made those hotspot obvious Fifth.
Retrain weekly because as new entity types enter production, the
graph neural network must adapt to maintain prediction accuracy.
Collectively, these practices turn a PIX from prototype into a production
service adapted by multiple teams.
Every study has limitations, cold start, variability per persist.
Even with Snap start, residual jitter can influence daily
latency and may exaggerate gains.
Synthetic bias is possible because synthetic workloads, while realistic may
overlook extreme traffic patterns found in the divide for future work will replay
full production traces to close that gap.
Model rate remains risk.
Shifting data distributions can erode predictions quality.
So automated drift alerts trigger retraining whenever
crosses a person threshold.
Recognizing these limits helps others decide whether A PIX fits
their environment and guides our roadmap for continued validation.
In closing, adaptive parallelism insights turns detailed.
Telemetry into timely recommendations that update step functions, workflow
safely benefited, delivered teams gain faster executions, lower compute
bills, and complete audit trail without changing business logic.
Next steps we intend to coordinate turning across multiple workflows that
share concurrency limits and explore web assembly based span features extraction
to reduce overhead even further.
Appreciation.
Thank you for your attention.
Your questions and feedback through conference forums are welcome and I look
forward to hearing how you apply adaptive observability in your own pipelines.
Thank you.