Platform-First Forecasting: Engineering Scalable ML Systems That Drive Business Value

Video size:

Abstract

Stop building isolated ML models! Learn how Amazon’s platform engineering transformed forecasting into $80MM revenue driver. Discover microservices architecture, containerization & API-first design patterns that enable strategic forecast consumption & auto-scaling. Build ML platforms that scale now!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, hi everyone. I'm Shiv. And I'm here to present a very interesting topic on forecasting which is platform first forecasting. And basically we are going to talk about engineering scalable machine learning systems that drive a lot of business value. And for the next 15 minutes, I want to discuss fundamentally how to build a really robust forecasting. Scalable forecasting systems that add consistent business value, right? And it goes beyond just focusing specifically on accuracy and focuses on the whole architecture of the whole, of the forecasting system in general. A little bit about myself before we dive into the details is I'm chef. I have almost 13 years of work experience across multiple industries in planning supply chain SNOB, and forecasting. I've worked with Caterpillar Dell. And right now with Amazon where I lead a team for a team of demand planners, a forecasting team for Amazon, trying to build something very similar that I'm going to talk about here. Highly scalable forecasting, machine learning based forecasting systems that are helping support Amazon's growth. Rapid growth, I would say. At a high level the agenda that I would be covering today is five key agenda items here. First we'll talk about what Platform Engineering Revolution means that is really upcoming right now. The Accurate Architectural Foundations on which it is built. Why? It goes beyond accuracy as a metric to you, define the success criteria for good forecast. And then implementation strategies really important on how we put this whole architecture into use for the business. And then some really cool real world success to stories where this is already being used. Awesome. Rethinking the whole machine learning based forecasting through a platform lens. It's really important that the machine learning landscape it's really important to understand that the machine learning landscape has undergone a lot of shift. It has, it is not just about this science model which is this machine learning model. It's about a lot of different elements that enable this model to be successful. The tuning, the the platform in which it is built the architecture that supports its deployment, that its usability the pipeline of data that feeds these models. Hence the platform based platform First forecasting represents a very holistic approach to machine learning system design, and it. Fundamentally prioritizes three things, right? Operational excellence. Fundamentally how this whole architecture is put to use scalability of multi-year vision on building a system that does not require complete re-engineering every one or two years, right? It has a really good five to 10 year roadmap. And then of course business alignment. How the business teams are eventually going to use it and put it to use. A key insight I would love to add based on my personal experiences brilliant models setting in research notebooks provide almost zero business value. While moderately accurate models deployed well through these scalable systems, add much more value than very sophisticated models that. End up living in research papers all their life, right? So it's very important to learn and understand how we can deploy these complex models for the business to use, and then derive a lot of really good value from them. Platform engineering revolution in machine learning has happened through, in my view, three main steps. First is abstraction without over oversimplification. Of course, abstraction helps a lot where really good, effective platforms hide the complexity of the models and allow the user to interact with the models using really simple and easy. Parameters, right? And I think that is a really important successful success criteria for this whole architecture. Progressive disclosure of complexity. It's really important that a user of some of these architecture and models are not overwhelmed with information. And complexity because the moment that happens it really hampers its deployment and its adoption in general. So it's very important that anyone who's going to use it is going to first we have to assume that they have minimal technical expertise and then create a really clear glide path on how we can onboard them and show them the complexity over time. And lastly, API first design. It's really important that APIs evolve over time. So clear versioned APIs for all platform interactions needs to be done. And it enables loose coupling between these forecasting systems and that consumers and it helps long-term scalability as well. Critical architectural components. I think four key components that enable this vision of platform. First forecasting ML systems is first is data, of course. Data is the brand butter of and machine learning model. Being able to build really solid data pipelines as well as a clean. Usable data feeds is. Fundamentally backbone of it. So that is the very first step of making this architecture successful. Having a solid data architecture. Number two is stream processing. Of course, technologies like Apache Kafka enables processing of a lot of the data that we feed into these models at a very high velocity. And being able to have something that can absorb and understand this data. Is super, super important. Security controls, again, a foundational element, right? Data privacy and intellectual prop. Property production are like foundation of any system that consumes and manages a ton of data. And being able to build something like that is really important. And a core building block for the architecture that we are discussing. Lastly, feature stores. Every machine learning model every machine learning system, I should say, is unique in its own self, right? Being able to understand, what features are going to be built for those specific use cases for those specific businesses is really important. And hence, understanding where the specialization is needed is very important. For tracking and for future service capabilities. So we know what we need to continuously evolve and what we need to maintain. Now talking a bit about the success metrics, right? How do we make sure that we understand the success of a forecasting system or an architecture? H how is that fundamentally understood by the business? The first thing in my view is that goes beyond accuracy, right? The fundamental accuracy may wave RMC or any of those standard accuracy measurement, metrics. The first one being, business alignment. How does the forecast enable anyone to make decisions or understand the risks, the trade offs, the constraints in the business? Because a forecast can be a single number, but having context to that number is really important. Is it too high? What is, what happened last week? Is it too low versus what happened last year? Same time, and having those. Having that perspective is really important. And that is where the business alignment comes in place, where the forecast is there. But does it align to the business's goals of, is it growing at the rate at which the business should be growing on a weekly level or at a, monthly or a yearly growth rate level? The second BI item is operational efficiency. One of the key things in my personal experience that I've seen is some really good models have really bad diplomats which ends up being perceived by the business as not a really good model then. The science might be really amazing, but guess what? Being able to deploy it and create recurring, really solid outputs every single week or month is really foundational to the system. And hence the ability to use to, to pick the right model and then deploy it for the business is really important. Then that is what get encompassed in this operational efficiency. How is and lastly of course, is the model consistently creating an. Guess what? One day it created an output. The second day it did not. It's just not consistent and reliable, no matter how good the science behind it is. And that perfectly draws into the reliability and consist consistency. Part of it is not just. The performance of the model on a weekly, week to week basis, but does it perform really well in a highly stressed and a complex environment? Businesses go through peaking periods and non peaking periods. Can it, can the model and the architecture handle those periods consistently week over week, day over day throughout the year without any issues? And there can be bottlenecks or exceptions where it may not do that due. May not I should say perform at the same level as a different time of the year, but being able to understand why and how, and having, being able to, having, being able to plan for it ahead of time is a key part of this whole architecture as a whole. The user satisfaction in the end, the customer is the customer's word is everything, right? Having that customer obsession to understand. Is the forecasting platform being used on a regular basis? Is it being deployed? Is it helping make really good decisions in the end of the day? Is the final stamp that confirms this, whether the forecasting process is successful or not, in a nutshell? Next I want to talk a little bit about asymmetric error handling. This is a really important topic and the reason I say that is oftentimes in businesses and throughout my experience, I've seen that businesses index on a single value, single number as a forecast, right? The forecast next week is going to be a hundred thousand let's say units that are going to be sold but. That single data point may not tell the whole story. Oftentimes businesses have different risk appetites and over forecasting might create less of an issue than under forecasting or vice versa, and that. Is an equally important data point or strategy to understand for the business than just looking at the single number forecast and its accuracy. To give really good examples is the first one I wanted to list down here is inventory management. Certain businesses, stockouts are a much bigger problem than having excess inventory where inventory costs might be very low and stockouts might be very expensive. And. Maybe having access inventory is an okay thing in some situations and making sure the service level is really high, because when the service level is low, guess what? You lose customers. Or the other way around where stockouts might be a much bigger sorry. I should say the other way around. Excess inventory might be a much bigger issue than stockouts, where having. Constrained inventory doesn't really create as much of business laws. As we would expect. So being able to understand for the business, which one is more important and which one is not, is a very important input. Similarly, infrastructure capacity. This is a classic example of should we go and invest more money in building more, in more capacity to serve our customers? Guess what? When we invest, it's really expensive. And when we invest, and guess what the de like the utilization does not come through. You're sitting on a lot of depreciation cost Similarly. If we don't invest, guess what? There's a service degradation. Is that okay for our customers? Being able to quantify and understand which one is more important is extremely important. Sim something similar happens in financial forecasting. The regulatory requirements and the risk of management protocols should the business go very heavy on, regulatory compliance, which requires a lot of, manpower and our time investment versus moving fast where, the business can choose. To be compliant at a bare minimum and keep moving fast to grow the business or being excessively indexed over meeting, the financial goals. And the punchline being, being, effective. Asymmetric error handling requires, a lot of loss function design that reflects actual business consequences rather than just. Something that is for mathematical convenience and having a platform architecture that understands and takes this feedback is really important. Next microservices architecture of ML systems. The decomposition of the, this monolith of forecasting system into focused microservices provides numerous advantages including improved scalability, enhanced maintainability, greater deployment flexibility. And the four big buckets that I want to talk about with respect to these microservices is first is data and ingestion services, right? Being able to handle diverse sources varying frequency. It updates daily, it updates hourly, and at the same time, having a really high quality of data, maintaining high quality of data is really important. The second one is feature engineering services. Again, being able to transform the raw data into something that the model can consume. And understand and interpret, and then create a final output is another foundational element. The model training services every model requires some training, some assistance in the short term or even in the long term. And being able to do that in a flexible and controllable and a well understood manner is a key step. In terms of making the whole model, the forecasting process and the whole architecture successful, right? So it's a very important pillar. And lastly, prediction. Serving services like providing real-time access to model predictions through scalable load latency interfaces enables our customers to consume the data and reduces any friction in terms of them for them to understand, Hey, what is a forecasting output? How do we use it? Where can it be immediately where it can be synthesized to be used for day-to-day decision making? And that's really important for the success of our customer. So being able to have really solid data pipelines through dashboards, through ui, ux interfaces for the customer to consume that information and deploy that immediately is another key step. Next I want to talk about containerization and orchestration strategies. Again, this is really important. When we look at, building really complex systems, because if we don't containerize and, understand the building blocks of or I should say being able to design how these building blocks interact with each other, the system get really complicated and of course become a roadblock in the long term to build scalable solutions. There, there are some key considerations that we should consider when a containerizing, these massive architectures that we built. One is fundamentally understanding how these containers are going to handle the machine learning work, machine learning workloads. Every ML model has a different workload requirement. Every architecture or use case has a different workload requirement, and being able to understand and containerize it based on the requirement is a foundational. I would say, or more than a foundation, like I think is one of the first steps are design steps that needs to be considered when thinking through the whole architecture. Of course there, there are other elements to it, like horizontal part auto scaling for demand patents is, has been one of the latest go-to methodologies. In the recent times that I've seen serviceness. Technologies for network and security is another use case. Persistent storage management for large data sets. Particularly important when. If we look at a longer horizon of an implementation, like four to five years, the data set is going to grow. How are we going to handle it? What are the policies of keeping and storing data? How do our customers want to read and consume this data on a day-to-day basis? And lastly batch processing capabilities for periodic tasks. Like you, you may not want to query a 20 million to 20 billion. Row table every single day. Maybe the customer only needs the latest and greatest. Let's say the forecast, the version of the forecast that we published only yesterday. Maybe we can do batch processing to manage the machine the full workload of this whole architecture, much better by segmenting and understanding what is needed by the customer on a day-to-day basis versus on lesser frequency and frequency. In general. And again, each system has its unique requirement. And hence being able to understand and document and then make sure that each of those, criteria and considerations are understood and is a part of the design process. It's it becomes really. Helpful, particularly in the long term for that system to scale. Lastly yeah API design best practices for immune systems. Really important again restful API design principles provide a solid foundation, but must be adapted to handle the unique characteristics of prediction services, resource modeling for prediction must. Balance, simplicity with expressiveness. Simplicity being really important for majority of the architecture with a unique touch of what is the, what is that complexity that is needed that is bare, like foundationally needed, that cannot be that, that not be present fundamentally. So the second one is request and response. Schema design must be flexible enough to accommodate diverse input formats while providing sufficient structure for validation and documentation. Authentication and authorization are really important that we don't give all the access to every single user. It should be based on every individual profile. And being able to manage that is becomes much more easier on the architecture to manage, multiple users. Particularly as you scale. Not all users will have complete access. While the majority of the users will have very specific access, while others will have the. Higher level access in the hierarchy so that they can change stuff and evolve things much better. A PA version strategies we discussed about this previously versioning is really important, particularly as we scale so that we have, really good backward compatibility of a system and we don't get stuck. In long term scalability. Now on the fun part, real world implementation success stories and the first two, like EI, I'll talk about three, e-commerce global logistics and financial service organization. I think these three other ones that I have been very close to personally. And I can really vouch on how platform based these platform based architectures have really helped businesses. The first one, e-commerce. I think working with Amazon, I think I've learned, like e-commerce grows rapidly and the use cases of forecasting systems is very unique. Forecasting a pen number of pens that we would sell throughout the year is very different than forecasting the number of phones or, chairs or couches that will sell in a year throughout the year and having different and because it's so different across the things. Because it is so different across these specific use cases we deploy different machine learning methodologies across each of these SKUs, and which in the end requires a very high compute, highly complex architecture. To manage right In the end, because the data sources are different, the ML models are different, the end users are different, and hence the whole containerization that I mentioned above, the microservices I mentioned above, becomes really important because we can build these systems like Lego block, Lego blocks where certain level blocks may not be important for one type of a forecasting use case. It might be very useful for others. And being able to do that is, has been one of the most interesting challenges that we have o overcome, right? Global logistics similarly like every leg of a logistics has a very different customer base. And a use case, a first mile is different than a last mile or a middle line. And being able to meet and fulfill the, fulfill those specific use cases of each leg requires. Platform based forecasting because it for, it, it builds a really solid foundation to meet different requirements using very similar Lego blocks with a solid architecture financial service organization. Again really important where regulatory compliance. Is very important in now in any business and being able to forecast and understand where risks exist, be it, our credit risk or market forecasting, the sales forecasting that we use to understand the long term cash flows, et cetera. And having really. Good. Interpretability in these models and bias detection mechanisms enables a very successful financial benefits any business. But because the finance because the financial systems will tell a much tour picture of the future state of the organization than anything else. Lastly, the future direction and conclusion. Conclusion I think. The key emerging trends that we have seen from our point of view is automobile technologies have been, have started to democratize, right? Like a lot of the models that we see today are available in almost any system. It, majority of them are available in most Python libraries or even AWS platforms, which can be leveraged almost. By anyone at a very low cost. And hence the being able to the federated learning piece where people are collaborating together to build singular models which are successful across multiple businesses. And then real time machine learning with continuous adaptation where the model evolves every single day, every single hour, I would say. And then, of course, sustainable computing practices. And agreeable is another key part of it. The key takeaway is I think the punchline being, successful forecasting platforms balance, technical sophistication with operational pragmaticism, right? It's about building a very complex system while also being able to deploy it successfully for the business to make really good decisions. And it's not just tied to the accuracy and accuracy should not be used as a bottleneck. To deploy complex and stable and really good systems to enable the organizations to make solid and really good decisions in the end of the day. Yeah. Awesome. Thank you. That's all that I had. I know I had a very little time to cover a really complex topic in the end, but if you have any questions, feel free to reach out to me on LinkedIn and I would happy and I would be happy to discuss more with you guys and share more of what I know. Thank you.

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform-First Forecasting: Engineering Scalable ML Systems That Drive Business Value

Video size:

Abstract

Summary

Transcript

Shivendra Kumar

Software Dev Engineer @ Amazon

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform-First Forecasting: Engineering Scalable ML Systems That Drive Business Value

Video size:

Abstract

Summary

Transcript

Shivendra Kumar

Software Dev Engineer @ Amazon

Join the community!