Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning everyone.
Thank you for joining me today.
My name is Aroma and I'm here to talk about something that keeps many of
us in FinTech awake at night burning financial systems that actually scale.
Today we are diving deep into engineering Athena, a journey of
building a scalable resilience and compliant financial trading platform.
Now I know what you're thinking.
Another scaling talk.
But here's the thing.
Financial systems are different beasts entirely.
They're not your typical e-commerce platform where you
can eventually consistent your way outta problems in finance.
If your system goes down during market hours, people lose money.
Real money.
If your calculations are off, even by a few basis points,
regulators come knocking.
And if you cannot explain exactly what happened to every trade
three years ago, you might be looking at serious legal trouble.
So today I'm going to share the story of how we transformed Athena from
a system that could barely handle thousands of trades to one processing
millions daily across global markets.
We'll cover the technical challenges, the architectural solutions, and
perhaps most importantly, the lessons we learned the hardware.
By the end of this talk, you'll understand why financial systems
require fundamentally different approaches to scaling, and you'll
have concrete strategies you can apply to your own systems.
Let's dive in.
First, let me set the stage.
Athena is a financial trading platform that handles entire life
cycling of trading operations.
We are talking about trade execution, risk management, profit and loss
calculations, real time market data processing, and massive end
of the day batch reconciliations.
To give you a sense of scale, we process millions of trades
daily across global markets.
Each trade might be worth hundreds of millions of dollars, and every
single one of those trades needs to be tracked, valued, risk assessed,
and reported to multiple regulatory bodies, often in real time.
Now before we jump into the technical architecture, I need to make sure we
are all speaking the same language.
Finance has its own vocabulary and understanding these core entities
is crucial to understanding our scaling challenges.
Let me walk you through a simple example.
Imagine Trader one wants to buy a hundred units.
Of a corporate bond from trader to this happens on August 5th,
2025 at exactly 3:00 PM GMT.
Both traders work for firms governed by different legal entities,
and the trade will be cleared through a central clearinghouse.
This single transaction creates what we call a trade the atomic unit, of
what actually happened at the market.
But here's where it gets interesting.
This one Trade creates multiple deals.
cTrader, once book shows that they now own a hundred units of this book, trader
TOS book shows that they received cash for selling these a hundred units.
These are mirror images of the same transaction, but they deliver in different
books and serve different purposes.
And that brings us two books.
Think of them as accounted ledgers.
A book is a collection of credits and debits, typically grouped
by trader desk and product type.
So we might have a corporate bonds book, a serious book, a European
derivatives book, and so on.
Now, why does this matter from the engineering perspective?
Because each of these entities has different scaling characteristics.
Trades are write ones read.
Many deals are frequently updated as markets move.
Books need to be aggregated and reported in real time.
Understanding these patterns is crucial to building efficient systems.
So A trade is a single executed transaction of a financial instrument.
It represents the actual buying or selling event, capturing
what happened in the market.
Most of the key attributes in Athena are the instrument.
For example, a bond, a CD.
Which is a derivative type of instrument.
The quantity or the notional, which would be correspond to the size of the
trade, the price or the premium, the executed price counterparties, which
are buyer and seller timestamp, trade date, the status, whether it's spending
settle, corrected, defaulted, et cetera.
Typically the lifecycle in Athena goes as created when the
TRA transaction is executed.
Updated.
Rarely for twice, but usually for deals just because of how markets move to
correct errors or mark status changes recorded always exist in the book forming
the basis for positions risk and p and l.
Let me show you how these entities flow through a typical trading ecosystem.
Now, this isn't just, this is the actual architecture that every major
investment bank uses with variations.
First, we have the front office.
This is where trade are born.
Traders, salespeople, or electronic trading platforms execute transactions.
These get captured in systems like Athena.
Then we move to the middle office.
This is where the magic happens.
Every trade gets validated for legal compliance.
Credit risk and regulatory requirements.
Trades get assigned to the appropriate books.
Price data gets enriched.
Risk sensitivities get calculated.
Only clean risk approved trades move forward.
Next is a back office.
This handles settlement.
We are generating settlement instructions, processing payment
flows, reconciling with counterparties, and handling corporate actions like
bond maturities or credit events.
Finally, we have books and piano, and this is where all those individual
trades get aggregated into meaningful business information daily.
Mark to market valuations, realized and unrealized gains and losses,
risk measures like value at risk.
Now here's the critical insight.
Each of these stages has different preference requirements.
Different performance requirements, different consistency requirements,
and different failure modes.
A traditional monolithic approach tries to solve all of this with
the same architecture, and that's where things start to break down.
Let me give you some concrete numbers.
In our front office, we might see trade burst.
Of around 10,000 transactions per second during volatile market conditions, our
middle office needs to validate and enrich each of these trades within 30
seconds to meet regulatory requirements.
Our back office processes settlement instructions that might
not execute for days or weeks.
Our books and p and l systems need to provide realtime position data
to traders while simultaneously running complex risk calculations.
Each of these has fundamentally different patterns, burst rights, realtime
validation, long running workflows, and mixed read write workloads.
Trying to hide a lot of this with a single database, a single application
server, and synchronous processing.
That's a recipe for disaster.
And that brings us to our scaling challenge.
Lemme paint you a picture of where we started pictured the classic
monolithic nightmare, but with financial regulatory requirements layered on top.
We had single everything, one ingestion instance, one application
node, one primary database with no replicas and no sharding.
The only way to scale was to throw a bigger and bigger hardware at
the Prop Classic vertical scale.
Every single call was synchronous and blocking.
If a trader wanted to see their p and l, that request would hit the same database
that was trying to process incoming market data feeds and run compliance reports.
Guess what would happen during market volatility?
The system would grind to hos, but here's what made it worse.
We were running everything as bad jobs.
Instead of processing data as it arrived, we'd accumulate changes all
day and then run massive recalculations.
These jobs would take eight to 12 hours to complete if anything fails at 3:00 AM our
traders would start their day with stale data and we had no back pressure handling.
When market data feeds would spike, which happens every single day
during major economic announcements such as a government change or war,
our ingestion threats would stop.
This would cascade to user facing operations.
Traders couldn't see their position because the system
was choking on routers feeds.
We were a single region deployment, which met high latency for our London
and Hong Kong traders, and any regional outage was a complete platform outage,
but perhaps, first of all, our compliance and audit data was just extra rows
in the same operational database.
This created noise enabler problems and made it incredibly easy to
accidentally break compliance in variant.
So this is an example of what the monolithic approach looks like.
At the top feeds user legacy rading, which is when you use like an
electronic platform to execute a trade or a CSV upload of any sort of
trades that the trader has already executed on their road platform.
Which then would have an HTTP pooling infrastructure in between, which
would provide all of these services to the ingestion service, which in
the traditional monolithic approach would be single instance and would
have some level of sync paring.
It would then go sit into a single relational database,
which would be one in one region.
It wouldn't actually have any replicas, and it would be mixed OLTP and OLEP.
Which would then eventually lead to a risk computation unit, a reporting
dashboard, and a compliance logger.
And all of these would run nightly in batches via a CR job.
Sounds simple.
This is the appropriate architecture if you have around a thousand traits.
But what happens when you try to scale it?
So core traits, why this won't scale.
This is basically single everything.
One ingestion instance, one op node, one primary tb. No replica, no shouting.
Only vertical scaling is possible.
It's synchronous coupling.
Every call is blocking.
Computer and reporting.
Sit directly on the old TP tables.
Mixed workloads on one DB ingestion, rights, risk analytics,
dashboards and compliance.
All hammered the same tables, and the same in diocese.
This is batch ology.
So heavy recalculation runs which run nightly via CR job instead of streaming or
incremental computation, no back pressure.
So every time there's a spike in feed volume, it'll stall
ingestion, threads and cascade.
Three users, single region.
So higher latency for global users and any regional outage
would equal platform outage.
Ad compliance.
So audit is just a few extra rows in the same db, the schema rigidity.
So no event versioning schema changes require coordinated
downtime, adventure time.
We are going to find out what happens when we scale this architecture.
This it's a graphical representation of what the Inges service feels
like, and there's millions of trays coming in every day.
It probably feels like this,
and this is probably what the database looks like.
So every box in this particular database is actually the same deal.
It probably differs from each other by a few parameters on
certain days by one parameter.
Here's a simple example that illustrates the core problem.
Let's say we have deal one that evolves over time.
On day one, we are holding instrument one worth a hundred dollars.
On day two, the market moves, so we are now holding $300
worth J three $500 and so on.
The nave approach is to store a complete record for each day, but
think about what it means at scale.
We millions of deals each potentially changing daily
over years of trading history.
That's terabytes of mostly redundant data, but here's the real kicker.
P and l needs to be calculated not just daily, but sometimes every
minute during wall tail markets.
Our traders need to see real time, profit and loss as positions change.
So now we need to index all of this data by date, by instrument,
by book, by trader, and we need these squares to be fast, even
as our data grows exponentially.
And that's just the beginning of our data challenges.
The redundancy of previous records can be addressed by relational databases.
Based on the dates, the normalized table of changes cannot be guaranteed
to function well because of what factors in the deals change.
The regime can change, legal entities can change, instrument
can change, and pricing can change.
So how do you maintain data that changes frequently?
In financial markets, data is usually dependent upon each other.
For instance, if there is a credit event, it leads to the evaluation of an
instrument, which leads to the change in the estate, which eventually leads
to a change in p and l, and the traders only care about the change in PL on the
other hand, because of the magic of the markets, which is a two layer case market.
There could be a change in the value of the instrument that
cascades into a change in piano.
What this means is all of these calculations are basically dependent
on each other and each other's state.
Let me give you a concrete example.
Here's deal one over four days.
We are holding Instrument one worth a hundred dollars.
The same instrument now is worth 150 due to market movement.
The market crashed on 30 and we are now holding $50 worth of that instrument.
On the fourth day, there was a corporate action that was taken
and instrument one gets converted into Instrument 12, which is.
A cashflow instrument worth $30.
Now imagine you're a regulator asking show me the complete audit trail of deal.
One, you need to reconstruct not just the value changes, but the instrument
evolution, the corporate actions, the legal entity changes everything.
Traditional relational databases start to buckle under this kind of complexity.
Normalized tables can't handle the table.
Schema changes.
Full snapshots create massive storage overheads.
Point to time queries require increasingly com complex joins as the data grows.
And here's what really keeps database administrators up
at night lock contention.
And during peak trading hours, especially around market open
and close, we are seeing massive concurrent updates to the same records.
Thousands of traders are updating their position simultaneously.
Market data feeds are updating instrument prices.
Every millisecond risk systems are recalculating exposures in real time.
All of this happening on the same set of database tables with the same indexes.
The result, very performance that degrades over time, backup and restore operations
that take longer than the business takes.
A system that becomes increasingly fragile as it grows.
So that was our reality.
A system that worked fine for thousands of trades but
completely fell apart at millions.
A system where adding new features meant risking the
stability of existing operations.
A system where a brilliant traders was spending more time waiting for screens
to refresh than actually trading.
Something had to change.
And that brings us to our solution.
Our first breakthrough came from a simple observation.
Most of the time when data changes, only a small subset of dependent
calculations actually need to be updated.
So here we are marking our functions with dependency decorators.
When we calculate position value, we are dependent on
instrument ID and market data.
When we calculate base currency value, we're dependent on
position value and FX rates.
The magic happens when something changes.
If the price of instrument one moves, our dependency graph automatically
identifies every calculation that depends on instrument one
and triggers only those updates.
Before this change, a single price movement would trigger a
full recalculation of potentially thousands of positions.
Now we only recalculate what actually changed.
This single architectural change reduced our end of day processing
time from eight hours to five hours.
But more importantly, it made real time updates feasible instead of minute
long delays for our p and l updates, we were getting subsequent responsiveness.
Our second major breakthrough was rethinking how we store data.
That changes frequently.
This is what a deal looks like, evolving over days.
In the traditional approach, we are storing complete records, everything.
Even when 49 out of the 50 fields haven't changed, it's massively wasteful.
In our Delta approach, we store the full record once the first state, then
we only store what actually changes.
Obviously, you have to maintain the head state for the day
you're doing the processing.
Day two might just be a price change.
J three might just be quantity and status change.
And here's the clever part.
We can reconstruct any points in time view by applying deltas sequentially.
Now let's talk about geographic re distribution In finance, this isn't just
about performance, it's about regulatory requirements and disaster recovery.
Our primary training hubs are in New York and London, close to major exchanges.
Each location has a primary database instance with three local
replicas for IO optimization.
But here's the tricky part.
We net synchronous replication within each region for consistency, but
asynchronous replication between regions to handle network partitions.
Why?
When markets are moving fast, traders can't wait for cross
Atlantic network latency.
They need local data to be consistent and fast, but we also need to ensure
that if London goes dark, New York has all the data it needs to keep operating.
This is a classic gap theorem decision.
In finance.
We choose consistency and partition tolerance, accepting higher latency
for global users during network issues.
This diagram basically shows you potential locations of book records
all across the us so you can consider each of these dolls as a book.
These are replicated across US geography because data loss events
are geographically connected.
Power outages, natural disasters, et cetera, are usually limited by geography.
And this is a representation of the TB hydra.
So as you can see, each of the deals is replicated multiple times
across different database instances.
This is possibly what it looks like across the globe.
Usually when trading, most deals are limited to certain regions
because of legal entities.
So you can see that.
Data is replicated across the us, across Europe, across Australia, and then there
is a specific DB that maintains all of the trades that are cross-regional
are cross legal entity based.
What this helps at is computing at scale.
This helps us paralyzed as much as possible because of the nature
of markets, trading is usually regionally legal entity based.
Athena maintains a different instance for across legal entity trades.
Often using clever techniques to add a third leg to the trade to be
able to compute efficiently making for some very clever corner cases.
For instance, if there's a trade that is cross legal entity, Athena will create a
trade in, say, a legal entity associated with legal entity North America and one
associated with a legal entity in Europe.
A deal.
Is associated with a legal entity in North America and a deal that is
associated with a legal entity in Europe, and a third deal that is associated
with coast legal entities between North America and Europe, which is
stored in a completely separate entity.
So that.
The data that exists only in the North American markets and the data
that exists only in the European markets can be computed and parallel
independently of each other.
Our end of day calculations used to be a single massive monolithic process.
If anything failed, we would have to start over from the beginning.
We completely redesigned this.
As a dependency aware parallel pipeline,
look at this architecture.
We break down processing into independent units, instrument a reference data
processing runs on one compute node, for instance, instrument B on another deal
processing for different books runs in.
But here's the key insight.
We maintain dependencies.
The p and l completion job cannot start unless all of the book
processing jobs are complete.
But if Book one processing fails and book two succeeds, we can restart
jobs for book one using the cash results from the reference data jobs.
This fall tolerance approach transformed our end of day processing instead of
a hr, all or nothing runs, we have resilient pipelines that can recover
from individual component failures.
The reason we have separated the instrument reference data processing from
the deal state book processing entirely is that instrument A could exist in a
deal that exists in both book one and book two, which means that the instrument,
a reference data processing needs to be complete before Book one processing
or book two processing can start.
Let me show you one of our most innovative solutions,
compute building blocks or CBDs.
The traditional approach pulls all the data to a central server
and does computation there.
This creates memory pressure, CPU contention and IU bottlenecks.
Our CBB approach creates independent compute notes and loads data
into its own memory space.
It executes computation independently and returns results for aggregation.
Think of it like AWS Lambda, but for financial calculations, its CBB gets
its own play fields to load data, run computations, and return results
without affecting the main server.
This approach gives us horizontal scaling for compute intensive
operations while keeping the main system responsive for user interactions.
So for instance, if you consider each of these as books all across the globe,
and for this particular example, if we are processing the book service
processing for the end of day calculations in Europe, each of these books.
Usually a book runs on multiple CBB notes, but for this instance, each of these books
could be sent to a particular CBB as a reference, and the CBB then communicates
with the Hydro DB to load the book details and all of the deals in one particular
book, which are unrelated to all of the deals in all of the other books.
They could be related, but for the purpose of these p and l calculations, which
happen on book to book basis, they can be considered independent, which means you
have so much of parallel processing and you will basically save a lot of time.
This is horizontal sharding because we are keeping data independent of each other.
So for every computation in order to optimize for in-memory computations
and more parallelism, every operation is chunked into independent computable
chunks, which form a lambda like function, which executes upon its own
CVB nodes and the results are collated.
For instance, instead of pulling graphs from instrument one to
compute metrics on the server itself, another compute block is shown a
functional hydroco to load data.
So it loads in memory without affecting the compute of the main server.
Basically, each doll gets its own play fields to beyond graves and graves.
Before we move on, let me touch on something critical.
Compliance architecture.
In finance compliance isn't an afterthought.
It's a core architectural requirement.
We learned this, the Harvey.
Our solution was to separate operational data from compliance data physically.
Our OLTP systems handle real-time trading with performance optimization.
Our OLAP systems handle regulatory reporting with long-term
retention and query optimization.
This separation prevents compliance queries from impacting trading operations
while ensuring we can reconstruct any historical state for audit purposes.
We also implemented automated compliance feed processing.
Every trade, every price change, every position update gets
streamed to immutable audit logs.
Regulators can access this data without touching our operational systems.
So this is the last step, but it's possibly the most important step.
In certain cases, it comes.
After the deal State Book one processing or after any deal state book processing.
A story I have about how important this is in financial, in the financial domain
is that there was a time when someone was running a Python machine learning base
analytics script on a bunch of deals, which ended up choking a queue and the
deals that were supposed to be processed.
Were left in the state of this deal has not been processed, though it had
been processed by the deal service, but not by the analytical script because
of some bucks in the script and it choked the queue and it basically led
to us paying a hundred thousand dollars in fines to the regulators because we
could not make the regulatory timeframe.
So back to earth.
Now let's address the elephant in the room.
Why Python?
For financial systems?
I get this question a lot.
Python is in trust.
Python has the GIL.
Python is dynam.
Clear type, all true.
But finance, these apparent weaknesses become strengths.
Financial instruments are incredibly diverse.
Bonds derivative swaps, credit default swaps each with different
attributes, different valuation models, and different risk characteristics.
Python's duct typing and dynamic typing lets us handle this
diversity without Bridget Schemas.
We can represent a bond and a seed with the same processing logic.
Python's dictionary structures work beautiful beautifully with our indexing
requirements and we need raw performance.
When we need raw performance, we drop down to c plus compute building blocks.
The result is a system that's both flexible for rapid business
logic changes and performance for compute intensive operations.
Let me wrap up with the key principles we learned building Athena Scale first.
Embrace financial domain complexity.
Don't try to force financial data into simple relational models.
The domain is inherently complex and your architecture
needs to reflect that reality.
Build flexibility into your data models from day one and plan
for frequent schema evolution.
Second, optimize for change, not just scale.
In most tech companies data.
Data grows, but doesn't fundamentally change In finance.
Data mutates constantly in unpredictable ways.
Dependency driven updates be it full recalculations delta
storage beats full snapshots.
Event driven architecture beats batch processing.
Third separate concerns physically don't try to optimize operational data and
analytical data with the same approach.
They have fundamentally different access patterns and requirements.
Fourth, design for failure.
When finance failure isn't just an inconvenience, it's a regulatory event.
As I mentioned before, built in independent processing units, cash
in intermediate results, and have graceful degradation strategies.
Let me share some of the pitfalls we encountered so you don't
have to repeat our mistakes.
Premature optimization.
We spent months optimizing the wrong bottlenecks because we didn't properly
profile our system under realistic load.
Start with clear business requirements profile before optimizing and
measure the impact of every change ignoring data lineage.
Financial regulators don't just want your current data.
They want to understand how you got there.
Built audit trails and data lineage tracking into architecture
from the beginning to bolt on compliance as an afterthought.
Third pitfall, understanding operational complexity, multi ian
deployments, monitoring distributed systems, handling partial failures.
The operational overhead is significant.
Plan for it and budget for it.
Looking ahead, the financial technology landscape is evolving rapidly.
Stream processing technologies like Kafka are becoming the
standard for real-time analytics.
Even sourcing is gaining adoption for complete audit trials.
On the regulatory side, we are seeing increasing demands
for real-time reporting.
Cloud adoption, ISS accelerating.
Even in highly regulated industries, data residency and sovereignty
requirements are becoming more complex.
Privacy regulations like GDP are impacting how we handle financial data.
Building scalable financial systems is one of the most challenging
problems in software engineer.
You're dealing with regulatory complexity, massive scale, real time
requirements, and zero tolerance for errors, but it's also incredibly reward.
When your system processes millions of trades flawlessly, and when traders can
react to market S in milliseconds, when regulators access audit trails instantly,
you're enabling the global financial system to function more efficiently.
The key is to respect the domain complexity while applying
sound engineering principles.
Don't try to solve everything at once.
Start with your core Delta model.
Build flexibility and observability.
Observability from the beginning and scale incrementally.
For instance, this is what Athena basically looks like.
This is what any trade model would look like.
Any trade management platform would look exactly like this.
There would be the trade fees, there would be a trade ingestion layer.
There would be a trade end to use service, which would eventually be
compiled into a book service and which would have to go to compliance.
The trade and deal service interacts heavily in this case with the
distributed compute, which interacts with the database, and also outputs
p and l and risk computations.
The CBBs are also associated with event payments, which pop out alerts,
notifications, and corporate actions, some of which eventually come and lie into.
The queue of the trade and deal service, depending upon your use case.
They also connect to monitoring dashboards and other regulatory reports.
Sometimes the compliance services and the regulatory reports are
connected based on what sort of regulatory agency you're reporting to.
So this is what Athena high level scalable architecture looks like.
We looked at the trade ingestion layer, we looked at queues.
We looked at the trade service, the deal service on the book service, which is
associated with the compliance service.
We looked at ways to parallelize all of these services and to share data because
most of the data, it's high scale, but it's independent of each other.
As a matter of fact, it is required for the risk and p and l calculations
that each book be independently weighed against the compute metrics.
All of these services, they interact with the distributed compute, again,
allowing for even more parallelization.
Which interacts with the hydro database, which is geo distributed, which basically
means that a trait that is executed in a certain geo region would only be touching
a particular geo-located service instance.
A geo-located compute block instance and a geo-located hydro instance.
And hydro is replicated in order to improve IO bottlenecks,
which eventually leads.
Regulatory reporting and other monitoring services.
If you compare these two models, this is the monolithic approach and
this is the highly scaled approach.
It looks almost the same, except this is after the SGDP polling.
This is not an asynchronous service, so anything that would be blocked
in between would be a blocker.
For any sort of other request that the service would want to get at that time.
On the other hand, because of our parallelization approaches, if a node
fails, there are other nodes that can take over because of the replication
of data, a lot of data is independent of each other, geo distributed, which
means any failures in London would not affect anything in New York.
On the compute node would not mean that the service is down for every sort of
request, which is not true in this case.
Thank you so much.