Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, hi everyone.
I'm Shiv.
And I'm here to present a very interesting topic on forecasting
which is platform first forecasting.
And basically we are going to talk about engineering scalable
machine learning systems that drive a lot of business value.
And for the next 15 minutes, I want to discuss fundamentally how to
build a really robust forecasting.
Scalable forecasting systems that add consistent business value, right?
And it goes beyond just focusing specifically on accuracy and focuses
on the whole architecture of the whole, of the forecasting system in general.
A little bit about myself before we dive into the details is I'm chef.
I have almost 13 years of work experience across multiple industries in planning
supply chain SNOB, and forecasting.
I've worked with Caterpillar Dell.
And right now with Amazon where I lead a team for a team of demand planners,
a forecasting team for Amazon, trying to build something very similar
that I'm going to talk about here.
Highly scalable forecasting, machine learning based forecasting systems that
are helping support Amazon's growth.
Rapid growth, I would say.
At a high level the agenda that I would be covering today is
five key agenda items here.
First we'll talk about what Platform Engineering Revolution means that
is really upcoming right now.
The Accurate Architectural Foundations on which it is built.
Why?
It goes beyond accuracy as a metric to you, define the success
criteria for good forecast.
And then implementation strategies really important on how we put this whole
architecture into use for the business.
And then some really cool real world success to stories where
this is already being used.
Awesome.
Rethinking the whole machine learning based forecasting through a platform lens.
It's really important that the machine learning landscape it's really important
to understand that the machine learning landscape has undergone a lot of shift.
It has, it is not just about this science model which is
this machine learning model.
It's about a lot of different elements that enable this model to be successful.
The tuning, the the platform in which it is built the architecture that supports
its deployment, that its usability the pipeline of data that feeds these models.
Hence the platform based platform First forecasting represents a
very holistic approach to machine learning system design, and it.
Fundamentally prioritizes three things, right?
Operational excellence.
Fundamentally how this whole architecture is put to use scalability of multi-year
vision on building a system that does not require complete re-engineering
every one or two years, right?
It has a really good five to 10 year roadmap.
And then of course business alignment.
How the business teams are eventually going to use it and put it to use.
A key insight I would love to add based on my personal experiences brilliant
models setting in research notebooks provide almost zero business value.
While moderately accurate models deployed well through these scalable
systems, add much more value than very sophisticated models that.
End up living in research papers all their life, right?
So it's very important to learn and understand how we can deploy these
complex models for the business to use, and then derive a lot
of really good value from them.
Platform engineering revolution in machine learning has happened
through, in my view, three main steps.
First is abstraction without over oversimplification.
Of course, abstraction helps a lot where really good, effective platforms
hide the complexity of the models and allow the user to interact with the
models using really simple and easy.
Parameters, right?
And I think that is a really important successful success
criteria for this whole architecture.
Progressive disclosure of complexity.
It's really important that a user of some of these architecture and models
are not overwhelmed with information.
And complexity because the moment that happens it really hampers its
deployment and its adoption in general.
So it's very important that anyone who's going to use it is going to first we have
to assume that they have minimal technical expertise and then create a really clear
glide path on how we can onboard them and show them the complexity over time.
And lastly, API first design.
It's really important that APIs evolve over time.
So clear versioned APIs for all platform interactions needs to be done.
And it enables loose coupling between these forecasting systems
and that consumers and it helps long-term scalability as well.
Critical architectural components.
I think four key components that enable this vision of platform.
First forecasting ML systems is first is data, of course.
Data is the brand butter of and machine learning model.
Being able to build really solid data pipelines as well as a clean.
Usable data feeds is.
Fundamentally backbone of it.
So that is the very first step of making this architecture successful.
Having a solid data architecture.
Number two is stream processing.
Of course, technologies like Apache Kafka enables processing of a lot
of the data that we feed into these models at a very high velocity.
And being able to have something that can absorb and understand this data.
Is super, super important.
Security controls, again, a foundational element, right?
Data privacy and intellectual prop.
Property production are like foundation of any system that
consumes and manages a ton of data.
And being able to build something like that is really important.
And a core building block for the architecture that we are discussing.
Lastly, feature stores.
Every machine learning model every machine learning system, I should
say, is unique in its own self, right?
Being able to understand, what features are going to be built for
those specific use cases for those specific businesses is really important.
And hence, understanding where the specialization is
needed is very important.
For tracking and for future service capabilities.
So we know what we need to continuously evolve and what we need to maintain.
Now talking a bit about the success metrics, right?
How do we make sure that we understand the success of a
forecasting system or an architecture?
H how is that fundamentally understood by the business?
The first thing in my view is that goes beyond accuracy, right?
The fundamental accuracy may wave RMC or any of those standard
accuracy measurement, metrics.
The first one being, business alignment.
How does the forecast enable anyone to make decisions or understand
the risks, the trade offs, the constraints in the business?
Because a forecast can be a single number, but having context to
that number is really important.
Is it too high?
What is, what happened last week?
Is it too low versus what happened last year?
Same time, and having those.
Having that perspective is really important.
And that is where the business alignment comes in place,
where the forecast is there.
But does it align to the business's goals of, is it growing at the rate
at which the business should be growing on a weekly level or at a,
monthly or a yearly growth rate level?
The second BI item is operational efficiency.
One of the key things in my personal experience that I've seen is some really
good models have really bad diplomats which ends up being perceived by the
business as not a really good model then.
The science might be really amazing, but guess what?
Being able to deploy it and create recurring, really solid outputs
every single week or month is really foundational to the system.
And hence the ability to use to, to pick the right model and then deploy it
for the business is really important.
Then that is what get encompassed in this operational efficiency.
How is and lastly of course, is the model consistently creating an.
Guess what?
One day it created an output.
The second day it did not.
It's just not consistent and reliable, no matter how good the science behind it is.
And that perfectly draws into the reliability and consist consistency.
Part of it is not just.
The performance of the model on a weekly, week to week basis, but does
it perform really well in a highly stressed and a complex environment?
Businesses go through peaking periods and non peaking periods.
Can it, can the model and the architecture handle those periods
consistently week over week, day over day throughout the year without any issues?
And there can be bottlenecks or exceptions where it may not do that due.
May not I should say perform at the same level as a different time of the year, but
being able to understand why and how, and having, being able to, having, being able
to plan for it ahead of time is a key part of this whole architecture as a whole.
The user satisfaction in the end, the customer is the customer's
word is everything, right?
Having that customer obsession to understand.
Is the forecasting platform being used on a regular basis?
Is it being deployed?
Is it helping make really good decisions in the end of the day?
Is the final stamp that confirms this, whether the forecasting process
is successful or not, in a nutshell?
Next I want to talk a little bit about asymmetric error handling.
This is a really important topic and the reason I say that is oftentimes
in businesses and throughout my experience, I've seen that
businesses index on a single value, single number as a forecast, right?
The forecast next week is going to be a hundred thousand let's say
units that are going to be sold but.
That single data point may not tell the whole story.
Oftentimes businesses have different risk appetites and over forecasting
might create less of an issue than under forecasting or vice versa, and that.
Is an equally important data point or strategy to understand for the
business than just looking at the single number forecast and its accuracy.
To give really good examples is the first one I wanted to list
down here is inventory management.
Certain businesses, stockouts are a much bigger problem than having
excess inventory where inventory costs might be very low and
stockouts might be very expensive.
And.
Maybe having access inventory is an okay thing in some situations and making sure
the service level is really high, because when the service level is low, guess what?
You lose customers.
Or the other way around where stockouts might be a much bigger sorry.
I should say the other way around.
Excess inventory might be a much bigger issue than stockouts, where having.
Constrained inventory doesn't really create as much of business laws.
As we would expect.
So being able to understand for the business, which one is
more important and which one is not, is a very important input.
Similarly, infrastructure capacity.
This is a classic example of should we go and invest more money in building more,
in more capacity to serve our customers?
Guess what?
When we invest, it's really expensive.
And when we invest, and guess what the de like the utilization
does not come through.
You're sitting on a lot of depreciation cost Similarly.
If we don't invest, guess what?
There's a service degradation.
Is that okay for our customers?
Being able to quantify and understand which one is more
important is extremely important.
Sim something similar happens in financial forecasting.
The regulatory requirements and the risk of management protocols
should the business go very heavy on, regulatory compliance, which
requires a lot of, manpower and our time investment versus moving fast
where, the business can choose.
To be compliant at a bare minimum and keep moving fast to grow the
business or being excessively indexed over meeting, the financial goals.
And the punchline being, being, effective.
Asymmetric error handling requires, a lot of loss function design
that reflects actual business consequences rather than just.
Something that is for mathematical convenience and having a platform
architecture that understands and takes this feedback is really important.
Next microservices architecture of ML systems.
The decomposition of the, this monolith of forecasting system
into focused microservices provides numerous advantages including improved
scalability, enhanced maintainability, greater deployment flexibility.
And the four big buckets that I want to talk about with respect
to these microservices is first is data and ingestion services, right?
Being able to handle diverse sources varying frequency.
It updates daily, it updates hourly, and at the same time, having a really
high quality of data, maintaining high quality of data is really important.
The second one is feature engineering services.
Again, being able to transform the raw data into something
that the model can consume.
And understand and interpret, and then create a final output
is another foundational element.
The model training services every model requires some training, some assistance in
the short term or even in the long term.
And being able to do that in a flexible and controllable and a well
understood manner is a key step.
In terms of making the whole model, the forecasting process and the
whole architecture successful, right?
So it's a very important pillar.
And lastly, prediction.
Serving services like providing real-time access to model predictions
through scalable load latency interfaces enables our customers to consume
the data and reduces any friction in terms of them for them to understand,
Hey, what is a forecasting output?
How do we use it?
Where can it be immediately where it can be synthesized to be used
for day-to-day decision making?
And that's really important for the success of our customer.
So being able to have really solid data pipelines through dashboards, through
ui, ux interfaces for the customer to consume that information and deploy
that immediately is another key step.
Next I want to talk about containerization and orchestration strategies.
Again, this is really important.
When we look at, building really complex systems, because if we don't containerize
and, understand the building blocks of or I should say being able to design how
these building blocks interact with each other, the system get really complicated
and of course become a roadblock in the long term to build scalable solutions.
There, there are some key considerations that we should consider when a
containerizing, these massive architectures that we built.
One is fundamentally understanding how these containers are going
to handle the machine learning work, machine learning workloads.
Every ML model has a different workload requirement.
Every architecture or use case has a different workload requirement,
and being able to understand and containerize it based on the
requirement is a foundational.
I would say, or more than a foundation, like I think is one of the first
steps are design steps that needs to be considered when thinking
through the whole architecture.
Of course there, there are other elements to it, like horizontal part auto scaling
for demand patents is, has been one of the latest go-to methodologies.
In the recent times that I've seen serviceness.
Technologies for network and security is another use case.
Persistent storage management for large data sets.
Particularly important when.
If we look at a longer horizon of an implementation, like four to five
years, the data set is going to grow.
How are we going to handle it?
What are the policies of keeping and storing data?
How do our customers want to read and consume this data on a day-to-day basis?
And lastly batch processing capabilities for periodic tasks.
Like you, you may not want to query a 20 million to 20 billion.
Row table every single day.
Maybe the customer only needs the latest and greatest.
Let's say the forecast, the version of the forecast that we published only yesterday.
Maybe we can do batch processing to manage the machine the full
workload of this whole architecture, much better by segmenting and
understanding what is needed by the customer on a day-to-day basis versus
on lesser frequency and frequency.
In general.
And again, each system has its unique requirement.
And hence being able to understand and document and then make sure
that each of those, criteria and considerations are understood and
is a part of the design process.
It's it becomes really.
Helpful, particularly in the long term for that system to scale.
Lastly yeah API design best practices for immune systems.
Really important again restful API design principles provide a solid foundation,
but must be adapted to handle the unique characteristics of prediction services,
resource modeling for prediction must.
Balance, simplicity with expressiveness.
Simplicity being really important for majority of the architecture with a
unique touch of what is the, what is that complexity that is needed that is bare,
like foundationally needed, that cannot be that, that not be present fundamentally.
So the second one is request and response.
Schema design must be flexible enough to accommodate diverse input formats
while providing sufficient structure for validation and documentation.
Authentication and authorization are really important that we don't give
all the access to every single user.
It should be based on every individual profile.
And being able to manage that is becomes much more easier on the
architecture to manage, multiple users.
Particularly as you scale.
Not all users will have complete access.
While the majority of the users will have very specific access,
while others will have the.
Higher level access in the hierarchy so that they can change stuff
and evolve things much better.
A PA version strategies we discussed about this previously versioning
is really important, particularly as we scale so that we have, really
good backward compatibility of a system and we don't get stuck.
In long term scalability.
Now on the fun part, real world implementation success stories and the
first two, like EI, I'll talk about three, e-commerce global logistics
and financial service organization.
I think these three other ones that I have been very close to personally.
And I can really vouch on how platform based these platform based architectures
have really helped businesses.
The first one, e-commerce.
I think working with Amazon, I think I've learned, like e-commerce
grows rapidly and the use cases of forecasting systems is very unique.
Forecasting a pen number of pens that we would sell throughout the year is
very different than forecasting the number of phones or, chairs or couches
that will sell in a year throughout the year and having different and because
it's so different across the things.
Because it is so different across these specific use cases we deploy
different machine learning methodologies across each of these SKUs, and which
in the end requires a very high compute, highly complex architecture.
To manage right In the end, because the data sources are different, the
ML models are different, the end users are different, and hence the whole
containerization that I mentioned above, the microservices I mentioned
above, becomes really important because we can build these systems like Lego
block, Lego blocks where certain level blocks may not be important for
one type of a forecasting use case.
It might be very useful for others.
And being able to do that is, has been one of the most interesting challenges
that we have o overcome, right?
Global logistics similarly like every leg of a logistics has a
very different customer base.
And a use case, a first mile is different than a last mile or a middle line.
And being able to meet and fulfill the, fulfill those specific
use cases of each leg requires.
Platform based forecasting because it for, it, it builds a really
solid foundation to meet different requirements using very similar Lego
blocks with a solid architecture financial service organization.
Again really important where regulatory compliance.
Is very important in now in any business and being able to forecast and understand
where risks exist, be it, our credit risk or market forecasting, the sales
forecasting that we use to understand the long term cash flows, et cetera.
And having really.
Good.
Interpretability in these models and bias detection mechanisms
enables a very successful financial benefits any business.
But because the finance because the financial systems will tell a much
tour picture of the future state of the organization than anything else.
Lastly, the future direction and conclusion.
Conclusion I think.
The key emerging trends that we have seen from our point of view is
automobile technologies have been, have started to democratize, right?
Like a lot of the models that we see today are available in almost any system.
It, majority of them are available in most Python libraries or even AWS
platforms, which can be leveraged almost.
By anyone at a very low cost.
And hence the being able to the federated learning piece where
people are collaborating together to build singular models which are
successful across multiple businesses.
And then real time machine learning with continuous adaptation where
the model evolves every single day, every single hour, I would say.
And then, of course, sustainable computing practices.
And agreeable is another key part of it.
The key takeaway is I think the punchline being, successful forecasting platforms
balance, technical sophistication with operational pragmaticism, right?
It's about building a very complex system while also being able to deploy
it successfully for the business to make really good decisions.
And it's not just tied to the accuracy and accuracy should
not be used as a bottleneck.
To deploy complex and stable and really good systems to enable the
organizations to make solid and really good decisions in the end of the day.
Yeah.
Awesome.
Thank you.
That's all that I had.
I know I had a very little time to cover a really complex topic in the end, but if
you have any questions, feel free to reach out to me on LinkedIn and I would happy
and I would be happy to discuss more with you guys and share more of what I know.
Thank you.