Platform-First Cloud Transformation: Engineering Scalable Infrastructure for Enterprise Success

Video size:

Abstract

Unlock platform engineering secrets from Fortune 500 cloud transformations! 40% faster deployments, 60% less overhead, 3x developer satisfaction. Battle-tested blueprints you implement immediately. Stop cloud chaos, start platform excellence

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. This is here. I work as a software engineer at Meta. Today I'm going to discuss about engineering scale with infrastructure. So primarily looking at various challenges. The current ecosystem of software engineering is facing essentially the increasing complex digital landscape. Crowd spending is projected to increase by six 78 billion by 2024. Platform engineering is extremely important in this context of for a cloud adoption and infrastructure management. So what are the various things that are extremely important when we are talking about platform engineering revolution? So faster deployments. Obviously we talk about all the microservice architectures, cloud deployments at scale and stuff like that. So eventually. Develop productivity and also engineering excellence almost always depends on how fast we can deploy things into production, right? These are some of the architectural constructs that can help significantly increase fast deployment to almost like 40%. And also reduce overhead in terms of processes manual steps and all the operational overhead that are necessary, right at the same time. Increasing their satisfaction. Obviously developer productivity, they're able to deploy the code, run the code. Release our features much faster. Definitely increases the developer satisfaction significantly. Definitely this is more in incre incremental gains across the stack. An amazing opportunity to transform any development organization into the evolving landscape of deployments. And let's talk more about what is this flash mom approach? First approach. So what does it actually mean? Again, looking at the current challenge, right? So there is definitely a lot of gap between, what developers need to build and what they need to know to build it. So essentially more what they want to build and obviously what they need to build it essentially, right? So traditionally DevOps models often burden developers with infrastructure concerns like creative cognitive workloads and slowing innovation, right? So what are we trying to achieve? Self-service infrastructure, you engineers or. Developers should be able to provision their resources. VMs run their workloads at scale, manage configurations, mostly self-serving, without even going through the complex processes of ticketing systems or various other things, but obviously reducing a lot of that trial in terms of, whatever is necessary. How do we make that more self service model and obviously some kind of a golden path. Predefined, optimized pathways guide to through common scenarios in open. That can help with all the best practice, how do we develop a code, how do we get the infrastructure necessary to, develop, deploy, test, and all the different constructs of it, software development, central governments. And also obviously, one extremely important aspect of it is like the more freedom we have. It's also extremely important that we have that enough controls checks in place anyway. It's more defensive mechanism. Obviously want to make sure. Leave lease on security principle and obviously whoever needs to have the access, they can have the access, but obviously we need to make sure that there is a balance between giving more flexibility at the same time. Also making sure that security, privacy complex things are also in are are also in place in a way. So let's talk about architectural patterns for multi-cloud success. Again, the most important thing is that a lot of companies are evolving into not single cloud environments. And now obviously every cloud has its own game positives and so primarily. Eventually a lot of companies are evolving into using this multi-cloud architecture. So looking at various architectural patterns that are used currently in the industry, one one pattern that we use usually see is like hub and spoke model, essentially more like a, a. Central Flo from Hub manages core services like authorization, monitoring, governance, all these found foundational components that are necessary for building any strong software stack. It's like a, it becomes a common hub and obviously you build services on profit. And the next important thing is once you build all the services, what is necessary for them to communicate? So it's like there is various constructs or various platforms that are already built with the service integrations like ETO and various other things essentially. When you go into the multi-services or microservices kind of environments, all these things will become extremely important. The developers doesn't have to really think about how do we, how does the orchestration between the systems are happening? Obviously, all these existing tools will really help accelerate the development a lot. One of that is like a, providing that consistent networking, interservice communication service measures like, TO and all that and Kubernetes kind of environment are extremely useful and important. And the other important thing is they have a policy as a code frameworks, essentially more like governance rules are actually codified and obviously more automatically enforced, and compliance without a manual direction, which makes it much more easier to run systems that scale. And these are, again, a very high level constructs of what are the important artificial patterns that have seen are extremely in any cloud environments. Let's talk about. Next, let's talk about how do we optimize developer experience in a world where development is changing. I know how we code is changing pretty significantly. It's extremely important to give the freedom for the developers to, build at scale faster and all those good things about, how can they test it, deploy it at scale. Providing all this framework extreme, but there's a few things I would touch upon the similar aspect of it, as most importantly, intelligent code AI tools. If you look at right now, it's like it's all lot of gen AI tools are extremely useful exhaust industry, like industry like something like we have a GitHub co. Like a lot of similar things where you know, all the other LMS also have similar tools as well. So these are. Proven to be extremely useful in building some kind of a boiler plate code. Whether it is like building some test cases for whatever code we have written and obviously helping provide custom insights. So how do we fix the code with any bugs? And it, they, things are evolving and also providing a lot, but these tools are getting much better as well as they get better. The user in developer productivity is also significantly increasing a lot. And the next important thing is like CACD models, but we need to have a strong CACD system. But again, what does A CCD mean? And automated testing pipelines, whenever developer. Pushes a code, a strong enforcement of testing all the core components as unit tests. And obviously we have some kind of integration test the code deployment as well. So having that necessary checks and balances would, we, would really improve our on aspect of it at the same time. We have right checks to make sure that, quality code is getting into production and other important aspects, self, et cetera. Things can go wrong all the time, right? So I think how can we build our system such that when something really goes bad, it can self heal. It can be like it. It can be as simple as saying some, the new build that is deployed into production, we have some kind of some kind of a cannery kind of deployments, right? In that case, if some new build doesn't work really well, it self heals and also go, goes back to the previously healthy build. So that's one example of how can we build infrastructure or deployment pipelines, which can really improve the developer experience. At the same time, we have that star stable software star running in production without any interaction at any point of time. These are a few things I want to touch on. Want to see how can we build strong developer confidence. And it is a lot of unnecessary, all of the developer, the best thing they're good at is like building code. So obviously providing that opportunity to build strong code and also giving all the tools and infrastructure to accelerate that and support that is the best thing that we can provide in any cloud environment. So that's one few things I want to touch upon this aspect. The next thing is, so infrastructure code care. So since we are taking care of the developer experience, they're able to build the code, deploy the code, test the code, and then extremely important aspect of is like now we need to also look at where it gets deployed. So basically you are trying to deploy things essentially, right? So it gets into some vm, some kind of a Kubernetes environment or various other constructs, or like a microservice. In a cloud kind of environment. So in that case the most important thing is like that. Now we are dealing with all the hardware of it, like all the VMs and how do we deploy as a microservice, whether it is allow like a docker containers or all that in the construct Kubernetes. So we have something like small parts and stuff like that. So how do we package them? How do you deploy them? These are extremely important constructs when you go into this at scale deployment. So building that, there are a few things I wanted to touch upon in the same aspect like. Building this infrastructure as called a few things like in a moderate template, libraries, and so this is more when we want to build reusable components, again, it's more there are various constructs to it. So basically autoscaling, the infrastructure is one part of it. If we are exceed some limits of the infrastructure, obviously we can have some kind of templates which can kick in and, expand as service. For example, that this, there is some construction. Our workloads are not more than what we have currently purchased from cloud, and we can have those con controls and checks to essentially expand our infrastructure in a way that we have more compute available to run our workloads. Similarly, GI Tops workflows is an extremely important role as well. Again, the same thing, providing audit trail of what has went and what is rolled back, and all those other good things there. Again, the other important thing is after we have provided all this as scale components, again, extremely important thing is we have to look at the complex scaling as well. So just making sure that whatever libraries that we're building as part of the infrastructure deployment is they're all security tested, is like a lot of other tools that can help us build even for the, as simple as docker containers or what the Kubernetes environment we. Can have some kind of automatic tools that can scan these containers and make sure that none of the libraries that are built and has some known library vulnerabilities and make sure that. These are compliant and secure before they're put in production. So that is one aspect of it, which are extremely important in a way. And once we touched upon that, first we talked about developer, how can we make him agile and make him faster, and stuff like that. And talked about what are the constructs of, infrastructure, how can we. Build an infrastructure which is robust and compliant and all those good stuff. Now, once we have all those important things started out, next important thing is like how do we make what we built is successful, right? So there are various metrics that I look that we can look at is technical metrics obviously. So I think from the technical aspect of it. So how fast are we deploying, right? The mean and lead time of the change. So know, mean, time to record if things and things go bad, how fast we can refer from it, and key insights. What are the failure rates looking like? So these are all like very good or important technical metrics that would eventually provide us insights into how fast we are deploying code, how fast we are reacting to change. How fast we are able to, roll back and, make sure there is less downtime to the customers, right? So extremely important aspects of it. And the other important thing we need to think about is like business metrics. Now we look at the technical construct of it, then we need to look at the com business aspect of it. So how much we are spending. What is the cost for deployment, ation rates and how much time it is taking for us to go to market. I know for example, in traditionally in the past, when the custom vendor is, less or no no cloud in a way. Like a lot of companies used to build their own data centers. Know, scaling the infrastructure is. The relatively difficult aspect of it because hardware has to be procured from somewhere and also provisioned and all those other good things have to happen. So before we can start utilizing it for an AI thing, right? In that aspect, I think. But if you look at in the cloud, you know how fast or how agile we can adapt to new changes. Like for example, I have to deploy a new feature which are, or a new product, for example, where, how fast we can get it to market. And obviously to think about an aspect where I need to release a product in some other country, right? It becomes extremely easy or I don't have to buy hardware from a different country or whatever is the good things there, right? So I think it becomes extremely easy for us to. And think about those kind of things. How are we progressing? How easy it is. Are those, some of those metrics are extremely important as well. Other thing is develop now we have built the robot systems. How do we know, the developers are really enjoying it or it's becoming really a painful work experience. So obviously some of the things that we could do is look at the surveys, get the feedback, try to see what is, what are the pain points and improve on it, right? Obviously know, I think it's, we have to improve our product and obviously we know on a HAL concept, what are the. Good things that these systems like our cloud systems can bring in. And every every system is different and we have to adapt to the surveys and feedbacks and make it better or better time. And now let's touch up on some important aspects of, the. What are the emerging trends in this cloud infrastructure, right? Platform engineering. There are a few things I want to touch on here. One is like AI power operate, right? Machine learning models. Now AI is everywhere. I think AI is disrupting a lot of industries. I think this is also an extremely important industry, in my opinion, where a can play a big role. No, A AI can be able to predict resources based on existing lit and other co. Other constructs of it can really provide, predict what are the needs of the future, optimize cost, provide insights, provide suggestions, how do we improve our cost, apply models, and I feel like there is a strong importance that AI can bring into this cloud platform engineering stuff where it can improve significantly in my opinion. Another important aspect is a Jira Trust engine Securities should be essential. Part of the cloud platforms is extremely important, that it becomes the next number one or, important thing while we are building this microservices framework in a more like a cloud environment. And yeah, one of the important construct is zero trust. We start with something considering that, people have lead next, pardon me, edge competing support. Platforms now. Now this is again, running on the edge, right? So more like the iot and realtime. I think this is all picking up really well. I think in terms of, increasing compute speed and providing that vision and also providing that capability to run it with increased networks and all that. I think iot realtime process is also has a signal opportunity here. And now let's talk about how do we implement these strategies, right? Obviously we can't really go from A to B directly and there has to be some kind of evolution here. So in my opinion, we start with a phase approach. You start with pillar teams, you start with a some form small teams, test out, build a real, the base infrastructure and hello teams to get add out to this new framework. Now, whether it is about Microsoft. Versus framework or if we are building Kubernetes platform, have some opportunity to build some base core constructs. For example, we can have something like, how do we deploy things? There is like a various things you can do. Like you can do some kind of canary deployments, blue grade deployments. It's a construct that you can build into the platform. The next thing is like what is a consistent way for the software teams to. Build their packages and deploy them into the, so for example, if you take Kubernetes as a Kubernetes environment this is again, providing that opportunity is also having that base layer of platform built that let cus let team set out this framework and, provide that opportunities for engineers to also come in and contribute to it and make it better. I feel that is an eventual process. We start with a few teams, pilot them and identify what are the other things that we can make it improve and you progress on other. Next important thing is once we have enough confidence with all the poor pilot teams, then we start about socializing with broader teams. Functional cross-functional teams and also thinking about how can we make it a shared ownership now is like this is a platform. Obviously people can come in and contribute and improve on it, and also extend on it and all those good things, right? So where other people's, other teams can also extend it and we have that big ecosystem of platform. Within the company. Yeah. The next important thing is again, as I was kept on saying it's a continuous evolution, things can improve, things can extend based on the feedback, we keep improving the platform now. We keep using it and improving it based on the rates. Now, next important thing I want to touch upon is, so this is a complexity gap I want to talk about. So we have operational tooling built and deploy. If you look at this like we have operational tooling and what all person need to know, there are knowledge and contracts. End-to-end responsibility. So I think this is just giving a high level view. What is the intersection? Operational tooling. You need operational tooling for the developers to build and deploy at the same time. Operational tooling for knowledge and conflict, right? So if you look at this suite, what of a cognitive load on the cus load on the developers, it's more like what are all the things that are extremely important, if you look at the midpoint where it kind of merges in, it becomes a collective responsibility essentially, right? So that is an extremely sweet spot there. So now let's talk about multi-cloud governance, right? Then again, once we have a strong system, fundamental foundation and billing all these good things, extremely important thing, as I was mentioning before, is having that security and compliance in place as well. So two important two or three things which are like a lookout. Things we need to think about is like inconsistent security controls is having, without having right policies in place, it becomes extremely difficult now calls management, when we move to cloud, there could be a lot of offerings, a lot of tools and risk things that can happen and cost can. Cost is an important factor when we use these things. So having that management and cost management strategy is also very important. And obviously co. Other thing is less compliance, so making sure that we are based on the industry we are in, we are compliant, and making have the, having that right controls in places is also very important. Things that we need to think about. The next important thing is like self-service infrastructure. Why do we need self-service kind of information, right? So now if you look at the graph on the left is like traditional approaching time. Used to take long time compliance issues, used to take long time and developer wait time. Obviously we talked about some of the aspects of, traditional approach of having that data centers. And next thing is like platform engineering. Now we can, we are building for the cloud, right? So how can these improve? Is what we touched upon so far. So it's more like how can we self-service infrastructure, it can drastically or dramatically increase the production times and obviously provide that insights into compliance issues and also then improve the developer productivity over time. So then I want to touch upon some golden parts. How do we balance control and freedom? So when we give freedom, or there are a lot of things like, with. Freedom is a lot of responsibility. It's extremely important to have right controls in place so that, yeah, every, all the things will run as per the security compliance guidelines, right? So the, so having that right controls also will help with few of the things that I want to touch upon. He's can, developers can reducing cognitive load. They have, they can just build business logic. They don't have to really think about, what are the infrastructure developments, accelerate development cycles, and obviously they are more important. How they want to apply un develop, right? So I think that's a good thing. They are. And they're good at it and providing that opportunity. And the other thing is like consistent implementation. Having that structured patterns, ensure security is also very important and flexible boundaries, right? When providing guidance is all for customization. When there is a new requirement, obviously demand can debate from standard pattern. So there are a few golden parts that we can provide, right? What do we think of the future of cloud work? So I strongly believe it's more than a technical evolution. I think the way it's the pace at which it is improving is significant. And I would say, so I would say is this has transformed a lot of industries. I would say providing that agility to the developers and the faster go to market, various things. So I think the path definitely is very clear in our robust platform engineering capabilities. Will definitely be the future. And building those foundational components across the stock are extremely important for the success. So those are all the basic constructs I want to touch for this talk. Thank you very much everyone. Have a nice day. See?

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform-First Cloud Transformation: Engineering Scalable Infrastructure for Enterprise Success

Video size:

Abstract

Summary

Transcript

Slides

Srikanth Vissarapu

Staff Software Engineer @ Crusoe

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform-First Cloud Transformation: Engineering Scalable Infrastructure for Enterprise Success

Video size:

Abstract

Summary

Transcript

Slides

Srikanth Vissarapu

Staff Software Engineer @ Crusoe

Join the community!