Platform Engineering for Modern Data Infrastructure

Video size:

Abstract

Stop wrestling with data platform complexity! Discover proven platform engineering patterns that turn infrastructure headaches into developer self-service wins. Real lessons from years of evolution: avoid vendor traps, scale efficiently, delight users.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm Rahul Joshi. Thank you for attending my session. I'm going to talk about platform engineering for modern data infrastructure. Where we are gonna see how the data infrastructure has evolved over last 15, 20 years, what all things have emerged as a new technologies and capabilities and how the organizations are generating and consuming data at large scale. And that has created a need. For platform engineering, we look at these systems as not as isolated platforms or components in the overall enterprise, but look at it as integrated platforms, self-service, large scale, multi-tenant, right access controls to enable all the business capabilities of the latest and greatest capabilities. That these platforms, enable. So let's take a deep dive and first we are gonna start with the evolution of data platform architecture itself. So as you can see, I'm showing three key here on timeline only big data era there around. 2006, 2007 Loop was, emerged as a, as an outcome of the Google's HDFS and MAP newspaper. And a lot of that has triggered, that, had triggered a lot of new use cases where, companies started to build big data lakes, ingest a lot of data into the lakes and process all that data by maintaining their own complex clusters, right? Clusters of commodity hardware. The principle for Hadoop was. It used to run commodity hardware and the platform engineers used to focus primarily on keeping these systems operational. So a lot of heavy lifting used to be done by these enterprise, and ultimately these teams that are owned by these enterprises. For platform engineering later, the cloud has changed. The infrastructure management part of it and data platform shifted towards managed services like WS, Google and Microsoft Azure Cloud. There are a lot of software as a service, conce where emerged Adobe managed services like Redshift to BigQuery and Snowflake. That abstracted a lot of operational complexity, but it introduced new challenges. As it comes with any new technology on cost optimization and vendor management because. Once you go on cloud, you have infinite, compute level access, so you can use, you have to use it wisely and cost optimization, that is key role on how do you be up to speed on all the computes and activities you need. But at the same time, how do you manage your cost efficiently? Then third is the Lakehouse architectures where technologies like Delta Lake, iceberg and Hoodie are enabling platform engineers to build systems that support both analytical and operational workloads on the same platforms. Unifying storage with strong consistency and the right governance. Each architectural phase here has contributed important lessons and has influenced how do we design and operate data platforms today. So we are gonna focus on the modern architecture and look at this thing on that lens of what are some of the things platform engineers have to be aware of or consider while designing these new platforms based on the new architecture patterns. So let's look at some of the core platform engineering principles here. If you're building a data platform set. Enterprise scale. You need abstraction and interface design. You have different types of personnel. There are business users. There are technical users, right? They're engineers and these platforms, data platforms today, which at the heart, heart of the company data is, powering so many business capability for, so you need abstract all the complexities that are, that goes into the implementation of the platform and make the interface. And they know easy for all these personas so that they can come and use this data platform capabilities, be it engineer or data scientist or analyst, business user. We need to balance what capabilities we enable to these end users as part of user experience and what capabilities we, we abstract as part of complex implementation. Second is self-service. So all these data platforms must provision resources, deploy, pipelines, right? Ability to create resources, use them, right? Exposition them again, and access data. Use data, access data as it requires in a self-service fashion because you are, if you're talking at enterprise scale, you're talking about thousands of users and applications, human users and applications using these shared multi-tenant data platform. So service capabilities plays critical role there. Where is infrastructure as code? Because we are in cloud era now, all these, a lot of different interconnected components that must work together seamlessly as part of cohesive solution. So platform engineers must they, a lot of people use infrastructure as a code to deploy and create resources, right? In instant deploy your applications. I update it, but this, these capabilities need to be. Used at scale. And you also need observability and monitoring sophisticated capabilities there so that you keep your data platforms healthy and you maintain the quality of data these platforms. And at the same time, look at how the data is flowing through. If there are any concerns like latency, freshness, and schema evolution, you are able to, look at these things and the data is flowing through at runtime and detect some of the problems proactively. Then, and, reacting to it later when some application fails in production. So these are some of the core principles or modern platform engineering actually look at if you're gonna do build the data platform or developer self-service, right? How do you create truly self-service data platform? That needs a fundamental shift in how platform engineers think about user interface and developer experience. The goal is to enable data teams work independently while ensuring their actions aligned with organization standards. So first, platform APIs and developer interfaces. You need well-defined, well-defined platform api. To provide programmatic access to the platform capabilities. These APIs could be rest, or similar standards, which are well established industry standards. So you are keeping your applications later aligned with latest and greatest. And you, you should have clear documentation and versioning strategies of when the schema changes for all these. APIs or contracts between the data producers and data consumers and the platform capabilities, it shouldn't be an issue. Second is standardized data processing. So there are, if you think about enterprise, there are so many different applications and personas and if they, if it's a shared, their tenant, multi-tenant data platform. Then, every use case needs to read data, get data out, all the data that's required, do some standardization, do some processing on that data, and then, create some business. Related, meaningful insight out of that data, right? So there is, we need some if we do not have standardization, standardized data processing patterns across these different use cases, every use case will do some redundant data processing in their own, application. So standardizing these data processing patterns is the key to avoid. All the redundant data processing and joints and, can be joints as an example, but it can be, reading some data, maybe flattening it and things of that nature, right? So you, the, you don't want to do all you, you don't want all your. All your applications to do some of these redundant things. So you have to centralize those and create consistent data processing patterns, standardize it so everybody can use, reuse whatever is available out there, and then build their own use case specific applications or implementations as required. Third is data discovery and catalog services. So how are you going to find what data is available out there for you? If you're working at enterprise scale, you might have thousands of applications and hundreds of systems producing data every day, and that's a shared tenant. So you have different lines of businesses using same data platforms. So identifying data, it, having a shared catalog of all these data sets, making data easy to find for all these different applications and humans is key for developing the self-service data platform. Next key thing is, big thing is resource management and cost control. As we discussed earlier, if your platform is self-service, then users can do. Whatever they have access to, they'll create resources. They'll run queries viewed as they want to get their results quickly as, as quickly as possible if they have right access. So how are we So resource management becomes key because you, you can have practically have infinite skill. Compute and storage when you're on cloud. So this is, managing your costs and processing workloads where for processing workloads, where they consume a lot of significant resources and compute our resource management and cost controls becomes another key dimension for designing modern data platforms. It's looking at multi-tenant architecture and resource isolation. So multi-tenancy or enterprises at least, where you have multiple lines of businesses. Each lines of business own their own. Source of system of records have their own business lifecycle for their customers or the end customers. So it generates data. Produces data, right? It needs to store data, needs to ingest data, it needs to process data. And if you, are they. They don't want to build, you don't want to build different data platforms for each line of businesses. So for doing the same thing. So you need, and that's where the multi-tenancy comes into picture. So you have to build your data platform in a way that the capabilities can be used across different lines of businesses. IE in tenants in this case. So each tenant have their own workspace, have their own right access controls. Have their own isolation in terms of how compute resources are allocated to it. Have the right access control, or sharing or not sharing data with other lines of businesses or subsidiaries or. Country specific geopolitical restrictions, geographical restrictions, and so on, so forth. And at the same time, you need a storage and performance isolation. So you need both logical separation and physical separation, isolation of dedicated storage systems and compute. So that had becomes well optimized and well managed data platform for a large enterprise where you can use multi-tenancy to to reuse as much of. Platform capabilities as spot to power all these different businesses. So what are some of the key cost optimization strategies? Let's understand some of the cost drivers. So primary cost usually includes computer resources for data processing because when data is at rest, it is stored on a platform. So when you run some, query or request to get it out, the compute is used, which basically goes and try to read data out and give it back to you. So computer resources for data processing is one of the key factor. The data is at rest. Your storage plays another. Key factor for your raw data and then the process data. If you're working on like data products or in different layers of data, as part of Alion architecture, you have raw data, you have refined data, you have, purpose built use case driven data, and all this data needs to be stored. So at rest you're gonna pay a lot of storage cost and then network cost at the time of compute, as well as at the time of data transfer between these different systems at runtime. So platform engineers must implement. Comprehensive tracking, monitoring, and, attribute these cost to specific teams, right? So we need a good chargeback models, right? Basically where who is using this platform, right? What are they using it for? What is the cost of their compute? So chargeback. Chargeback is usually well known and well followed model across large enterprises where if there is a centralized theme, let's take an example there, there is a centralized data lake, which is on cloud, is managed by enterprise data team, and then different lines of businesses can use data lake not to ingest their data, store it there to consume it, to process it for any use case. So chargeback model can be actually used to track all these expenses and I'll assign it back to the right. Line of business where they can get visibility on how much they're spending. Are they overspending? Are they, using these platforms efficiently or not? What are things they can optimize? All the thing things teams can do is they optimize the jobs by looking at the scheduling on when this data needs to be processed and ing and scheduling rightly, all the workloads could optimize. The usage of the platform can use autoscaling and descaling as your, if you are aware of workloads and if your workloads and the demand of compute is predictable. And third, you can use tiered storage. For example, on S3, you can use intelligent tiered data gets moved to the lower. Lower, frequent, frequently accessed tier to lower, frequently accessed tiers, low usage tier, and things of that nature. So that's gonna save a lot of money for you that all these cloud vendors are coming up with capabilities. S3, for example, has intelligent tiering, but even if not you, if you are aware of your data lifecycle. Data retention policies, you can apply the right data retention policies and move your data sets from a tier to another tier as part of the lifecycle management. Alright, let's take a look at next slide. So automation is key. We talked about infrastructure as called looking at, again, thinking about enterprise scale data platforms here, if you are talking about thousands of applications where you need to create the, provisioning these resources, right? Compute, then infrastructure automation is required where you can create these resources, storage resources, compute resources in a, in an automated way using configuration frameworks, security policies, network rules, and monitoring systems. Pipeline automation is required for all these different applications that are gonna use your platform to deploy these pipelines on your platform. Test it out. No. Move it all from dev to QA to UT Uua to production and monitor the health of these pipeline in terms of performance and failures. Third is continuous improvement. So att refinement of these automation processes based on operational feedback and best practices from the industries could be taken and implemented back into these automation. Fourth is configuration management. So you have to ensure the, there is a consistency across all these different environments because then you have multiple environments, f, qa, prod, UAT, all the, all of them should have exactly same. Consistent configuration so your, all your users and applications get the same experience and early detect all the issues, if any, and they don't face any discrepancies in terms of experience of the behavior of the plaque across different ones. Alright, so monitoring, observability and data quality. Another key dimension, deep dive on, so observability in data platforms is monitoring. Traditional infrastructure monitoring along with data specific concepts. So we are talking about data platform share. So there is an infrastructure level monitoring platform level monitoring whether the platform is an up and running, it's healthy, how is it performing, how is the scalability, how is the usage? How is the compute usage, what's the throughput and things of that nature. And then we, because it's a data platform, there are some data specific concerns, like data quality, pipeline performance, right? Data schema, evolution. It. Errors, failures, e rejection of data processing rates, right? Thresholds and things of that nature. So infrastructure and performance monitoring covers tracking, resource utilization, system performance and service availability across all the platform components. Cloud will naturally give you a lot of monitoring out of the box, but if there are, if you are building your own capabilities. For your data platform, you need to build your own monitoring and alerting for your applications. For data quality, you have to, and pipeline monitoring. Some of the key things you need to look at is there any data missing? Is there any data leak between your pipeline? If it's a data pipeline, you getting duplicate records? Are you missing any records? Are there any schema validation violations? Are there any statistical anomalies? You can have your own data quality rules and have your pipeline and platform actually run those data quality rules when you're trying to process data or when you're trying to, ingest data. So tiered alerting systems with different escalation procedures from different issue types usually is the right way to, to implement monitoring observability and data quality at a large scaling. Let's look at security and compliance. Many, not just regulatory industries like finance, right? Almost all industries and all the enterprises in today's cloud world facing a lot of security, concerns, right? Cybersecurity and compliance, and really for regulated industries. So security and compliance is not is very critical. For all the enterprises, no matter if you're regulated or not. So that becomes a key requirement or dimension while designing your modern data platforms data encryption and protection. So because we are talking about data platforms here, how are you going to keep your data? State, how are you going to keep data secure? You can actually implement multi-level encryption for data at rest as well as data in transit and in memory data masking tokenization, right? How do you protect your sensitive information? That becomes a key capability. You can have different data platforms, types. There could be data platform types which are allowed to save customer, customer information, sensitive information with the title access controls, there are, there could be data platforms where you tokenize or, anonymize all your data and then store data there. So those are not, those platforms do not have customer sensitive information. Access controls and authentication becomes key. You have to make sure you have sophisticated systems and access controls. At multiple granularity level, you can have role-based access controls. You can have attribute based access control. You can have or screen green and fine grain access control at a file level or a table level or at a roll level, but at the record level and at the column level. So all these different capabilities become key or key implementation techniques to enable right access controls and authentication mechanism for your sharing data platforms and compliance and audit capabilities. Comprehensive audit, logging, data governance, automated compliance. These are the things which are covered on in this section. Efficient system capture. Detailed access and modification logs while providing query capabilities for reporting and investigation. If you're using things like Delta Lake or Snowflake for example, you have a table format and you are allowing your consumers to use SQL based connection, then you are gonna be fine because SQL based consumption tracks what data is being used, who is using it, what attributes are being used. But on non SQL patterns, you may want to build extra capabilities. File consumption, for example, identify and track on who's using which data, which file, which attribute what they're doing and things of that nature. So platform engineers must design security systems and capabilities that provide comprehensive protection while maintaining the performance and usability. You don't want to, you don't want to have a security capabilities, which are impacting your performance and usability of the platform, or limiting users to actually use the data. So that becomes a key. Consideration for platform engineers. How do you future proof your data platform architecture? Because the, as we saw, the data technologies and capabilities are emerging so rapidly. You have to think about how are you gonna, while designing new platform, you're gonna think about how you, how do you want to future proof your data platform? So one. You have to design architectures that can accommodate new tech technologies without vendor locking, right? A lot of this cloud. Services, managed services Q into vendor locking. So you have to load your data into their systems. For example, Redshift, right? Snowflake. Historically, you have to load data into Snowflake, so you want to implement abstract layers. You want to keep your data platform as open standard as you can. So you can, and have an abstraction there on top of it. So your users and consumers would not be impacted even if you change your underlying storage or compute, layering future from one thing to another. So managing your vendors correctly and managing your, architectures to keep it go with, as much open formats as you can, becomes the key. Keep it future for second is scalability and performance planning as you. Create multi-tenant shared platforms. You need an architecture that can scale for growing data volumes from terabytes, to petabytes and so on, so forth. Because the data is keeps growing only as a lot of businesses are, enabling data powered AI capabilities and ai, platforms and, understanding their customer behavior. The data is gonna grow. It's not gonna spend in future. So you need capabilities that and sustain these growing data. Data volumes, increasing usage, growing data ingestion, data processing, data transformation. You need caching strategies, multi-tiered storage strategies, all inbuilt in your. Platform architecture so that in future, when your data grows, your usage grows, your platform will scale as required. And third, you need flexible architectures that can accommodate business growth. Changing analytical environments and evolving governance needs. So organizational, evolution, if you are, if become, if you add, if your business adds more capabilities in future, which might need in governance or some compliance, your platform should be able to easily enable it as required in future. So successful data platform. Engineering requires balancing multiple competing priorities that cannot be achieved through technology alone, but require helpful consideration. So some of that is flexibility, consistency, self-service, governance, cost optimization performance. We covered all of these in our previous slides. And these are, these become key dimensions for, platform engineers and architects to, to basically design the data platforms in modern data infrastructure. So the future of data platform engineering, as the data technology continues to evolve. Platform engineer must remain adaptable while maintaining focus on core principles. So four key principles, core principles that we covered. One abstraction. How do you abstract all your capabilities that how do you abstract the complex implementations and enable multi-tenant platforms for your users and applications? Second, how do you automate? And second is automation. First is abstraction. So automation, how do you. How do automate your configuration management? How do you automate your deployment? How do you automate your monitoring and, predictive analysis, all the issues in there in the pipelines. Third is observability, which includes monitoring, alerting, data quality, data level observability, and infrastructure observability. And fourth, last but not the least, is user experience. How do you make your data. Easy to find for your consumers through the right governments and right user experience and how it is made, available for our use for both human users and application users across different lines of businesses. These principles provide a stable foundation for navigating technologies change while building platforms that can grow with the organization needs some of the emerging areas. Where a lot of companies are exploring to have their data platforms support key, some of the new capabilities is real time processing. How do you enable stream processing in real analytics? Second is ML integration. How do how do you avoid seamless integration of your machine learning and analytics models with your data platforms? Or you have your data platforms isolated, then all your ml. Workloads cannot use your data as seamlessly as they can. And how do you, how could you build cloud native, fully distributed and containerized architectures to gain maximum benefit of your cloud architecture? These are some of the key areas that platform engineers need to consider while defining while designing new data platforms. So the platform engineering approach to data infrastructure offers a path to manage complexity of these data platforms while enabling innovation and agility. But what we looked at there are three core pillars, right? Abstraction, automation, and user experience looked at two key balances, which is flexibility with governance and self-service with control. Ultimate with the ultimate goal of enabling transformative business capabilities. So by applying these principles thoughtfully and consistently, organizations can build infrastructure that serves as a foundation for data-driven decision making and business growth powered by AI and ml. Thank you so much for watching my session. I hope this was helpful. Bye.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform Engineering for Modern Data Infrastructure

Video size:

Abstract

Summary

Transcript

Slides

Rahul Joshi

Software Engineer @ Meta

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform Engineering for Modern Data Infrastructure

Video size:

Abstract

Summary

Transcript

Slides

Rahul Joshi

Software Engineer @ Meta

Join the community!