Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Building Self-Service Data Platforms: Engineering Scalable ETL Infrastructure for Developer Experience

Video size:

Abstract

Transform chaotic ETL ops into self-service magic! Learn how 19+ years of enterprise experience turned complex data pipelines into developer-friendly platforms. Real stories from Fidelity, Southwest Airlines & more. Your devs will thank you!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi. Good morning everyone. Good evening. I'm Christian Chu. I have 20 years of experience in it, so most of the, my experience belongs to data migrations and data handling. And I different have plenty of expertise in different areas like financial, automobile, energy, healthcare, different platforms. I worked on it. I have gained a lot of good experience on data migrations and data handling projects like heterogeneous data and very large quantity of data used to handle. And I work with the data governance and it supports and controls and everything. I worked on it. They like ET tools I worked on now data stage, Informatica, SSIS, talent, these, the tools, which they are developer and new to me is to use. And I used those. I do have good knowledge on those tools and what are the data? I work on it so generated by the tools and as per the business requirements. So I handle a lot of big teams. Like I used to manage the team around the 12 people on onshore and put people offshore. We have a onshore offshore model. And so most of the area I used to work on data handling side ETLE projects, most of things and that too in migration and banking projects like JP Morgan and US Bank. Those banking side, I work around it. Most of these migration projects like MSP two BS promotions and those areas and coming back to investment sites. I work with the, currently I'm working with Fidelity Investments that, where it's also cloud-based creation project, which I'm handling right now. So it's dealing with all four Okay. Plans and those stuff. Okay. And in my career I used different BM databases like sql Teradata, DB two and Cloud basis databases like Databricks Snowflakes and Azure sql. So these are areas I work on different tools. I work on it. Most of the tools I used to upgrade myself with new technologies and to furnish the new requirement to the projects. And beside that is automation side. I use the PI spark to get this good quality of data by using the PI spark automation tools and set the turn project I'm using and jc. So with those two also, we are doing automation. Where we can do the declaration. There are a lot of validations out there, so many checks out there. So on capturing screenshots and validating those logs and everything is done by the JC automation tool means automation script. Yeah. And beside that I work around on data. As you mentioned. I work on lot of times and cloud outside. So here I'm going to give a presentation. Building self-service, data platform and engineering schedule infrastructure for developers, our companies and those stuff. And coming to the outline of this project is the evaluation of platform engineering has fundamentally transformed how organization of projects, data infrastructure. Slope step, how does infrastructure is developing in that, in terms of data approaches? So day by day there are new challenges new challenges we have to face and we need to know, incorporate those new challenges and we need to come up some solutions. So to do that, we have good infrastructure on that. And the traditional ETL operations are being reimagined and self-service platforms that empower, develop developers while maintaining the enterprise grade reliability and governances. Yeah. And the same time, we need to maintain the traditional et operations as well being we need to take as a foundation of those and we need to. Based we need to build a new infrastructure and new ideologies and we need to transform from the traditional ETL new ETL cloud-based or self-service data platforms. Yeah, this is, this slide tells you about the platform engineering ERs. Like product oriented thinking and developer integration and balancing cable and complexity. The first point is product orientation training. The platform engineering represents a chip from the traditional infrastructure management to the product oriented. Thinking about internal capabilities. And and the second one, develop integration. The most effective data platform feels natural to application and developers and integrating with familiar CACD and process and established pattern sec, continuous integration and condition deployment. Third process will be there whenever the development is done with a code deployment and code built. They'll push this code into the CACD pipelines where it'll take auto it'll go for the pull request and where we then the code will apply into the code and then seamlessly the process will develop. Coming to the other point is balancing complexity and it's modern platform must be sophisticated enough to enterprise scale processing while reminding simple enough to general purpose developers. And see it is modern data platform is handing it, the database, huge data is coming out. So like more transactions, more legacy data that the tool that the selfless server infrastructure should be sophisticated enough too, handle those situations. First thing, while reminding is simple enough to general purpose of develop this transformation is from solid data engineering to platform driven approaches reflects. Border industry trends towards DevOp integration and self service capabilities, eliminating traditional handoff between teams. So by doing, once we did the automation of data moving, or it's a CS, CD and other stuff, so as of now, we are depending in each team, like hand up between the teams one, like TLC one, LC in the process. Once we one step is done, we'll go to the second step. The parallel also, people work. But going forward by using this list transformation to the latest technologies or data handling things. So those kind of dependencies we are going to eliminate actually in between the teams dependencies, so the architectural foundations for TL. So what could be the architecture foundations for extract transformation loading. Okay. Building effective self-service data platform requires architecture platform that accommodate both current requirements and feature growth. The m and the foundation typically centers on below points. So first thing is that building effect to self service data platforms. It should handle the current requirement as well as the previous requirements, sir. So migrating the legacy data or current handling the current transactions that. It should good enough to handle the previous answer feature requirement as well. The foundation typically is microservices architecture that decompose the data processor into discrete composed components and containers orchestration platform like Bernet, providing the Runtime Foundation and carefully selected data processing engines like Apache Airflow. These control schedulers and security and make sure that you know the usage of data and security point of view. Make sure that, we are a secured security level. Also, we need to give you a lot of permissions as to provide to the customer and people. The challenges lies in the abstracting bernet complexity from end users while processing access to that. It's the powerful scheduling and resourcing management capabilities. Sir, that's the main challenging. What current industry is facing, starting Bernet complexity from end user side and while processing, reserving access to the powerful scheduling and resource management capabilities, sir? So if you give the wrong wrong means the person who is not into the team or is not part of the project, if you give the access to him. Their issues will come at the same time if you give access to end user. So there're going to be tremendous data leakage will be there if you don't know if you go to the wrong hands. There is a issues will be there and there's more come, comes under the resource management capabilities. And here this slide talks about metadata driven pipeline architecture. So the pipeline architecture means before going insert slide, as I would like to give you I insights of this. So pipeline. So how that design, the Azure pipeline is designed the, my previous projects, so there have the data from the different sources. They'll pull it to the, and they'll put it to the landing zone, landing two integration zone integration to the target. So for each from sources to, they have their one pipeline and the landing two integrate. Raw two landing zone, they have one. And planning to integration tool curated and there is a different pipeline should be there. So these are all integrated by each level. So in integration, our planning to integration, they have some ETL transactions will be performed and integration tool curated. There is different level of transactions will be performed. Okay, coming back to this slide. Metadata, data driven architecture, the concept of metadata driven, ETL represents foundation. Fundamentally shifted from the impressive programming models to they declared to approach us that separate business logic from the implementation details. So that's what I just now mentioned. So traditionally we used manually, we need to load the data and we need to welfare manually. We need to place the data. So then we need to process from the implementation details, like from business logics, from implementation details means like metadata. And running the interpretations and optimizations. So we need to follow these approaches. Like a structure. Metadata means that pipelines defined through the structure metadata describing the transformation data called the requirement and operational parameters. So as mentioned, the data is transformed one and John two other drone and the data quality requirements also. So when you, the site is moving from the rock integration, the data quality, also we need to verify the data quality is coming up. As per the requirement is coming or not. These kind of checks we need to perform and operational parameters. Also, we need to give proper settings like if any realtime data is coming out. So those transactions, how to handle bulk data handling, and that design is you need such a way that you know, if any sharing loads coming. So you need to a new transformation, new approaches has to be handled on that. And runtime, interpretations platform and the runtime, interpretation metadata to generate and execute appropriate programming workflows. So the runtime interpretation means generate metadata to generate and execute the appropriate process workflows. So I mentioned that the different pipelines will be there and we have a airflow deck where we need to go and trigger it. Manual. Also, we can trigger the pipelines, just this workflows. Also, we need to, once all development is done and QA is done, then make sure that once is moved to production is appropriately. Those are workflows running continuously without any issues, sir. Okay. Opportunity for the parallel parallelization coaching and resource optimization. Sir. Here if 20 people close are working, this should not any breakage or any wall or things like performance issues did not come and coaching. Any, process every time we have to if any newcomers coming we need to make sure that, documentation is pro correctly maintained on each and every step so that if any new changes comes so we make sure that we take it from there, wherever the last step. So that documentation ing we should maintain and if newcomers are coming, if new people are joining the team. So we need to use the coaching on them. Means we need to mentoring them and resource optimization. So make sure that who are using the project and they should have correct issues and we should have correct combination of people who is working on the access and those stuff. So this, we need to do the resource optimizations and make sure that the current confirmation of the systems and infrastructure we are using those. Also. Also, we need to make sure that it's part of implement and transformation, implementation of architecture. And this approach makes pipelines more maintainable and less prone to implementation errors since their focus on business requirements rather than the technical details. So if you maintain all these steps like structural metadata, runtime, integration, ation, by doing this more, there is no issues with the maintenance and no error errors, no implementation errors, and only focus on the business requirements rather than the technical. Integration with the CACD and DevOps workflow. So Mo, most of the time nowadays so what are the development is done? So directly straight away is going for, they're using the CACD integration tool and DevOps support of devs. So once a developer extract the data pipeline, deploy into the flow, similar patterns, establish application deployment, including. Portion control automation testing and staging deployments. Road steel. So by using this is continuous integration deployment. We can control this we can do this like portion controlling and from our previous version to current version and latest version, those we can do the washer controlling automation testing. So nowadays, no manual testing is happening. So most of everywhere, once the code is deployed. So we have existing test scenarios and test code source there where when our code is available to give the pre request test pre request test data will be there. Insert the data and on the test route, we'll get automation testing will be done. And based on based scenarios and less than manual effects and stage deployments, whenever, a deployment is done. So they'll wait for approvals and they'll waiting. Stating means they waiting for deployments and means one by one, one by one, and deployments will happen. So roll back bilities If something goes wrong in indication part, there is a capacity, we can roll back those latest versions and we will keep it as a world version to make sure that business is not impacting, get based workflows have become this become the defecto standard. For managing infrastructure and application code, the platforms must extend these patterns to pipeline definition and related to articles. Yeah. And continuous in integration pipelines for data processing. Present unique challenges captured to the traditional application to CAC and extending execution timelines. Yeah. Cloud native AP integration patterns. That's also one of the data platforms servicing AP integration as well. Modern data platforms master seamlessly integrated with the cloud native services while providing consistent abstraction that provide vendor locking. Okay, here we have three topics I would like to cover here. One is Azure Data Factory Integration data Databricks Integration, A P Principles. So first one is the Azure Data Factory integration. Platform should be simplify data factory pipelines creation while representing access to the advanced features when needed, typically through the template based approaches. Yeah, and Databricks integration is here, requires careful consideration of a cluster management. Notebook deployment and job orchestration pattern and automatic automated pro make carefully consideration. So we should not create, make sure that the cluster management, what cluster we are using and job orchestrations we are using, like what notebook we are using. So those kind of cloning while doing the cloning should do the correct cloning and cement correctly. Coming to a PA design principles a restful intersection interfaces with clear resources model and consist handling error handling. Provide the predictive tive experience graph. QR implementations can offer more flexible query capabilities. Here, the AP design, also interests will be there with a clear resource model and consistent error handling. So if we use, we used to use the a p services right there most of the times, the error fee issues since it is a cloud native integration to make sure that whatever the configurations notes we are giving those request should be, make sure that it's and clearly is defined. So that will provide any errors. Yeah. Apache Airflow Accession Foundation, this is one of the kind of where airflow is placed. Size, road awareness and Apache Airflow Service has a physician backbone for many modern data platforms, as you told the IT and crucial role, Apache Airflow. Duty. It's a Python native two approach and extensive ecosystem integrations. So airflow. Airflow. So we can integrate it with the latest latest tools. So environments we can do it. And it have Python based native, it's a Python based it's a Python is more emerging. So the airflow also, that's the Python native approach, right? So that's easy to use. Actually. And the key implementation concern inclusion is multi-tenancy and resource isolation and DAG generation that deploy pattern deployment patterns and custom operator development and monitoring and integration. So the DAG generation deployment patterns and their DS jobs. So our generations and the deployment patterns, development patterns so based on the development and how many Ds are required. So we need to identify, we need to schedule that pattern also to make sure that in correct order and monitoring and directing integration for one job is triggered. Then the monitoring will be there, and if there's any failure, there could be any locks. While running. Also, we can see the logs and make sure that what is issue and for this completed and what stage it is, those we can monitor them as well, rather than requiring users to write Python deck definition directly successfully. Platforms typically provides high level abstractions that generate air for d from metadata definition. So metadata definition itself there's a generated air four tax will be created. And we can use those patterns and we can start using those into our development and we can schedule it actually infrastructure as a code for ETL development and infrastructure as a code and principle transformation, ETL development for manual error from process to automated repeated procedures to integrate it naturally with the software development workflows. Tag workflow. And here there's three three points I like to discuss here. One is Terraform modules and helm charts, GI Tops Flow. The Terraform modules are encapsulated common infrastructure patterns for data processing workflows, enabling teams to deploy the complex, multiple, multi-service through the simple configuration declarations. And by, by using helm charts, provide the Bernet native approaches to infrastructure as code that align with containers on the data processing or pressures and through the deployments and coming to top our workflows, infrastructure changes triggered by the commit kit repositories, creating audit trials and enable sophisticated approval workflow through. Tools like ango, cd, and Flux. Work for data quality. That's this is the one of the critical role is the data assurance. Data Quality testing represents fundamental requirement for the enterprise data platforms. Such traditional testing approaches often through. In inadequate for modern data processing workflows. Yes, comprehensive testing strategies includes, testing with data generation and integrion testing in container based environments. In my validation testing for back backward compatibility, performance testing, performance regression testing with the baseline. Comparison. So by using these four levels of testing and make sure that whatever the data is handling and processing as data quality would be data quality will bev perfect. And unlike traditional software testing where the mock object can simulate external dependencies, data crossing tests often required representative to data sets that capture the complexity and each cases. Present in the production data? Yeah. Production data means production data is final data, so it should not any issues in errors in data. So by doing the good quality of checks, we can avoid errors in the production, data monitoring and observability of standards. There are four ways of we are doing the monitoring and distributed testing and mattress collections and log aggregations in digital directing. By using these four we are monitoring and observability standards. And here comprehensive monitoring and observability capitals are essential for the maintaining reliable data platform. Operating it enterprise scale. Enterprise level, and distributing pricing is a critical for understanding complex data processing workflows that span multi multiple service services and processing stages. System like Gear and Kin provides a detailed execution visibility and went to mattress corrections time series and databases like influx db provider efficient storage and querying cables of both technical performance in indicators and business relevant process outcomes coming to the log aggregations tool, like Elastic Research and Splunk, and high volume of data. And log output with sophisticated such and analysis capabilities for rapid problem identification coming to intelligent Yeah. By aggregations, by using this bunk research is a very quick we can come issue where exactly is happening, and intelligent alerting, sophisticated alert correlation escalations policies like ML based anomaly detecting and. Subject issues are reducing the, yeah real world implementation experience practice implementation of self-service data platform across the, and diversity enterprise and environment provides the valuable insights c to both architectural pattern and organizational changes management strategies. So here the industry specific challenges are included financial services, like appliances, data governance and risk management requirements, and healthcare, coming to healthcare like HIPAA complaints, patient privacy and Alexis and banking fraud detections, realtime processing core system integrations. And those we need to see alien operational s and disaster require. And airlines, sorry. Operational alliances and the disaster require capabilties. Yeah. And developer experience and load connections. The source of self-service data platform ultimately depends on their ability to reduce the loads for development. Teams while providing the powerful data processing capabilities like lead tool design, like well designated CLA tools for flow establishment conventions for pat parameter handling, output formatting and error reporting with the compress help system and conversions. So by using this CLA digital tool so we can perform all this on. Like handling output, formatting, error, reporting, all this things, we can handle this. And web interfaces must provide ation of complex data processing workflow while enabling sophisticated ations and monitoring ties. Sir documentation interactive document is system that combines explanatory content. That executable example provides effective learning experience. Governance and compliance and platform architecture, so data lineage, audit, logging, and access control and privacy production. So as I mentioned, so data governance and compliance is very key to the company. Based on that, only if you speak openly, the stock market and all the stuff will be based on that. This ambulance only, so there is no data leak or they maintain all standards so that all the business is doing. All standards will be on, accounting will be there year and the data line automatically capture the data flow information as a pipeline and execute, maintain detailed records of source submissions and designation for the compliance reporting. And access controls integrated with the enterprise entity system while providing the fine permissions through VAC systems for dynamic ations. Audit logging capture the compress records for all user actions, data access and configuration changes with sufficient detail for complaint reporting, privacy protection, sorry. Regulatory requirement like D-P-R-C-C-P-A through the authorization data discovery classification, an normalizations and delete capabilities, deletion capabilities. So go data governance and complaints are the key for the company. So by using these four things that, that these architecture office, like the data line, access control, audit, logging, and privacy protections. And this is the feature ions and conclusion of this task and transformations of aging here. The emerging patterns of our wireless computing for more efficient resource affiliation one and machine learning integration with model electric management. Yes. And real time processing and ties with the steam processing fire frameworks and edge computing scenarios spanning cloud and edge environment and a port platform operations for optimization monitoring. And these are all patterns and new technologies and new to do the data handlings and data transformations from the legacy to the new one. To success. Yeah. Building the effective self service data plan is required, is a holistic approach that balance the technical, sophisticated, and user experience considerations the most. Successful implementation treat platform development as a product development with the clear user persons, interactive implement cycles, and comprehensive feedback mechanism. And operations organizations that successfully implement comprehensive data platform. That strategies, again, significant competitive advantage through the improved developer productivity, faster time to market, and more reliable operations. These are all key to success directions actually. Yeah. Thank you. And thank you for giving opportunity to present myself and my work and giving thoughts on it and I'm happy to share my experience and presentation here. Thank you very much.
...

Batchu Krishna

Software Engineer @ Fidelity Investments

Batchu Krishna's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content