Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Good morning everyone.
Good evening.
I'm Christian Chu.
I have 20 years of experience in it, so most of the, my experience belongs
to data migrations and data handling.
And I different have plenty of expertise in different areas like
financial, automobile, energy, healthcare, different platforms.
I worked on it.
I have gained a lot of good experience on data migrations and data handling
projects like heterogeneous data and very large quantity of data used to handle.
And I work with the data governance and it supports and controls and everything.
I worked on it.
They like ET tools I worked on now data stage, Informatica, SSIS, talent,
these, the tools, which they are developer and new to me is to use.
And I used those.
I do have good knowledge on those tools and what are the data?
I work on it so generated by the tools and as per the business requirements.
So I handle a lot of big teams.
Like I used to manage the team around the 12 people on
onshore and put people offshore.
We have a onshore offshore model.
And so most of the area I used to work on data handling side
ETLE projects, most of things and that too in migration and banking
projects like JP Morgan and US Bank.
Those banking side, I work around it.
Most of these migration projects like MSP two BS promotions and those areas
and coming back to investment sites.
I work with the, currently I'm working with Fidelity Investments that,
where it's also cloud-based creation project, which I'm handling right now.
So it's dealing with all four Okay.
Plans and those stuff.
Okay.
And in my career I used different BM databases like sql Teradata, DB
two and Cloud basis databases like Databricks Snowflakes and Azure sql.
So these are areas I work on different tools.
I work on it.
Most of the tools I used to upgrade myself with new technologies and to furnish
the new requirement to the projects.
And beside that is automation side.
I use the PI spark to get this good quality of data by using the
PI spark automation tools and set the turn project I'm using and jc.
So with those two also, we are doing automation.
Where we can do the declaration.
There are a lot of validations out there, so many checks out there.
So on capturing screenshots and validating those logs and everything
is done by the JC automation tool means automation script.
Yeah.
And beside that I work around on data.
As you mentioned.
I work on lot of times and cloud outside.
So here I'm going to give a presentation.
Building self-service, data platform and engineering schedule infrastructure for
developers, our companies and those stuff.
And coming to the outline of this project is the evaluation of platform
engineering has fundamentally transformed how organization of
projects, data infrastructure.
Slope step, how does infrastructure is developing in that, in
terms of data approaches?
So day by day there are new challenges new challenges we have to face and we need
to know, incorporate those new challenges and we need to come up some solutions.
So to do that, we have good infrastructure on that.
And the traditional ETL operations are being reimagined and self-service
platforms that empower, develop developers while maintaining the enterprise
grade reliability and governances.
Yeah.
And the same time, we need to maintain the traditional et operations as
well being we need to take as a foundation of those and we need to.
Based we need to build a new infrastructure and new ideologies
and we need to transform from the traditional ETL new ETL cloud-based
or self-service data platforms.
Yeah, this is, this slide tells you about the platform engineering ERs.
Like product oriented thinking and developer integration and
balancing cable and complexity.
The first point is product orientation training.
The platform engineering represents a chip from the traditional infrastructure
management to the product oriented.
Thinking about internal capabilities.
And and the second one, develop integration.
The most effective data platform feels natural to application and developers and
integrating with familiar CACD and process and established pattern sec, continuous
integration and condition deployment.
Third process will be there whenever the development is done with a
code deployment and code built.
They'll push this code into the CACD pipelines where it'll take auto it'll go
for the pull request and where we then the code will apply into the code and
then seamlessly the process will develop.
Coming to the other point is balancing complexity and it's modern platform must
be sophisticated enough to enterprise scale processing while reminding simple
enough to general purpose developers.
And see it is modern data platform is handing it, the
database, huge data is coming out.
So like more transactions, more legacy data that the tool that
the selfless server infrastructure should be sophisticated enough
too, handle those situations.
First thing, while reminding is simple enough to general purpose
of develop this transformation is from solid data engineering to
platform driven approaches reflects.
Border industry trends towards DevOp integration and self
service capabilities, eliminating traditional handoff between teams.
So by doing, once we did the automation of data moving, or it's a CS, CD and other
stuff, so as of now, we are depending in each team, like hand up between the teams
one, like TLC one, LC in the process.
Once we one step is done, we'll go to the second step.
The parallel also, people work.
But going forward by using this list transformation to the latest
technologies or data handling things.
So those kind of dependencies we are going to eliminate actually in
between the teams dependencies, so the architectural foundations for TL.
So what could be the architecture foundations for
extract transformation loading.
Okay.
Building effective self-service data platform requires architecture
platform that accommodate both current requirements and feature growth.
The m and the foundation typically centers on below points.
So first thing is that building effect to self service data platforms.
It should handle the current requirement as well as the previous requirements, sir.
So migrating the legacy data or current handling the current transactions that.
It should good enough to handle the previous answer
feature requirement as well.
The foundation typically is microservices architecture that
decompose the data processor into discrete composed components and
containers orchestration platform like Bernet, providing the Runtime
Foundation and carefully selected data processing engines like Apache Airflow.
These control schedulers and security and make sure that you know the usage
of data and security point of view.
Make sure that, we are a secured security level.
Also, we need to give you a lot of permissions as to provide
to the customer and people.
The challenges lies in the abstracting bernet complexity from end users
while processing access to that.
It's the powerful scheduling and resourcing management capabilities.
Sir, that's the main challenging.
What current industry is facing, starting Bernet complexity from end user side
and while processing, reserving access to the powerful scheduling and resource
management capabilities, sir? So if you give the wrong wrong means the person who
is not into the team or is not part of the project, if you give the access to him.
Their issues will come at the same time if you give access to end user.
So there're going to be tremendous data leakage will be there if you don't
know if you go to the wrong hands.
There is a issues will be there and there's more come, comes under the
resource management capabilities.
And here this slide talks about metadata driven pipeline architecture.
So the pipeline architecture means before going insert slide, as I would
like to give you I insights of this.
So pipeline.
So how that design, the Azure pipeline is designed the, my
previous projects, so there have the data from the different sources.
They'll pull it to the, and they'll put it to the landing
zone, landing two integration zone integration to the target.
So for each from sources to, they have their one pipeline
and the landing two integrate.
Raw two landing zone, they have one.
And planning to integration tool curated and there is a different
pipeline should be there.
So these are all integrated by each level.
So in integration, our planning to integration, they have some
ETL transactions will be performed and integration tool curated.
There is different level of transactions will be performed.
Okay, coming back to this slide.
Metadata, data driven architecture, the concept of metadata driven,
ETL represents foundation.
Fundamentally shifted from the impressive programming models to they declared
to approach us that separate business logic from the implementation details.
So that's what I just now mentioned.
So traditionally we used manually, we need to load the data and
we need to welfare manually.
We need to place the data.
So then we need to process from the implementation details, like from
business logics, from implementation details means like metadata.
And running the interpretations and optimizations.
So we need to follow these approaches.
Like a structure.
Metadata means that pipelines defined through the structure metadata describing
the transformation data called the requirement and operational parameters.
So as mentioned, the data is transformed one and John two other drone and
the data quality requirements also.
So when you, the site is moving from the rock integration, the data
quality, also we need to verify the data quality is coming up.
As per the requirement is coming or not.
These kind of checks we need to perform and operational parameters.
Also, we need to give proper settings like if any realtime data is coming out.
So those transactions, how to handle bulk data handling, and that design
is you need such a way that you know, if any sharing loads coming.
So you need to a new transformation, new approaches has to be handled on that.
And runtime, interpretations platform and the runtime, interpretation
metadata to generate and execute appropriate programming workflows.
So the runtime interpretation means generate metadata to generate and execute
the appropriate process workflows.
So I mentioned that the different pipelines will be there and
we have a airflow deck where we need to go and trigger it.
Manual.
Also, we can trigger the pipelines, just this workflows.
Also, we need to, once all development is done and QA is done,
then make sure that once is moved to production is appropriately.
Those are workflows running continuously without any issues, sir. Okay.
Opportunity for the parallel parallelization coaching
and resource optimization.
Sir. Here if 20 people close are working, this should not any breakage
or any wall or things like performance issues did not come and coaching.
Any, process every time we have to if any newcomers coming we need to
make sure that, documentation is pro correctly maintained on each and
every step so that if any new changes comes so we make sure that we take it
from there, wherever the last step.
So that documentation ing we should maintain and if newcomers are coming,
if new people are joining the team.
So we need to use the coaching on them.
Means we need to mentoring them and resource optimization.
So make sure that who are using the project and they should have correct
issues and we should have correct combination of people who is working
on the access and those stuff.
So this, we need to do the resource optimizations and make sure that the
current confirmation of the systems and infrastructure we are using those.
Also.
Also, we need to make sure that it's part of implement and transformation,
implementation of architecture.
And this approach makes pipelines more maintainable and less prone to
implementation errors since their focus on business requirements
rather than the technical details.
So if you maintain all these steps like structural metadata, runtime,
integration, ation, by doing this more, there is no issues with the maintenance
and no error errors, no implementation errors, and only focus on the business
requirements rather than the technical.
Integration with the CACD and DevOps workflow.
So Mo, most of the time nowadays so what are the development is done?
So directly straight away is going for, they're using the CACD integration
tool and DevOps support of devs.
So once a developer extract the data pipeline, deploy into the
flow, similar patterns, establish application deployment, including.
Portion control automation testing and staging deployments.
Road steel.
So by using this is continuous integration deployment.
We can control this we can do this like portion controlling and from our
previous version to current version and latest version, those we can do the
washer controlling automation testing.
So nowadays, no manual testing is happening.
So most of everywhere, once the code is deployed.
So we have existing test scenarios and test code source there where when our
code is available to give the pre request test pre request test data will be there.
Insert the data and on the test route, we'll get automation testing will be done.
And based on based scenarios and less than manual effects and stage deployments,
whenever, a deployment is done.
So they'll wait for approvals and they'll waiting.
Stating means they waiting for deployments and means one by one, one
by one, and deployments will happen.
So roll back bilities If something goes wrong in indication part, there is a
capacity, we can roll back those latest versions and we will keep it as a world
version to make sure that business is not impacting, get based workflows have
become this become the defecto standard.
For managing infrastructure and application code, the platforms must
extend these patterns to pipeline definition and related to articles.
Yeah.
And continuous in integration pipelines for data processing.
Present unique challenges captured to the traditional application to CAC
and extending execution timelines.
Yeah.
Cloud native AP integration patterns.
That's also one of the data platforms servicing AP integration as well.
Modern data platforms master seamlessly integrated with the cloud native
services while providing consistent abstraction that provide vendor locking.
Okay, here we have three topics I would like to cover here.
One is Azure Data Factory Integration data Databricks Integration, A P Principles.
So first one is the Azure Data Factory integration.
Platform should be simplify data factory pipelines creation while
representing access to the advanced features when needed, typically
through the template based approaches.
Yeah, and Databricks integration is here, requires careful
consideration of a cluster management.
Notebook deployment and job orchestration pattern and automatic automated
pro make carefully consideration.
So we should not create, make sure that the cluster management, what cluster we
are using and job orchestrations we are using, like what notebook we are using.
So those kind of cloning while doing the cloning should do the correct
cloning and cement correctly.
Coming to a PA design principles a restful intersection interfaces
with clear resources model and consist handling error handling.
Provide the predictive tive experience graph.
QR implementations can offer more flexible query capabilities.
Here, the AP design, also interests will be there with a clear resource
model and consistent error handling.
So if we use, we used to use the a p services right there most of the times,
the error fee issues since it is a cloud native integration to make sure that
whatever the configurations notes we are giving those request should be, make
sure that it's and clearly is defined.
So that will provide any errors.
Yeah.
Apache Airflow Accession Foundation, this is one of the
kind of where airflow is placed.
Size, road awareness and Apache Airflow Service has a physician backbone for
many modern data platforms, as you told the IT and crucial role, Apache Airflow.
Duty.
It's a Python native two approach and extensive ecosystem integrations.
So airflow.
Airflow.
So we can integrate it with the latest latest tools.
So environments we can do it.
And it have Python based native, it's a Python based
it's a Python is more emerging.
So the airflow also, that's the Python native approach, right?
So that's easy to use.
Actually.
And the key implementation concern inclusion is multi-tenancy and resource
isolation and DAG generation that deploy pattern deployment patterns
and custom operator development and monitoring and integration.
So the DAG generation deployment patterns and their DS jobs.
So our generations and the deployment patterns, development
patterns so based on the development and how many Ds are required.
So we need to identify, we need to schedule that pattern also to
make sure that in correct order and monitoring and directing
integration for one job is triggered.
Then the monitoring will be there, and if there's any
failure, there could be any locks.
While running.
Also, we can see the logs and make sure that what is issue and for this
completed and what stage it is, those we can monitor them as well, rather
than requiring users to write Python deck definition directly successfully.
Platforms typically provides high level abstractions that generate
air for d from metadata definition.
So metadata definition itself there's a generated air four tax will be created.
And we can use those patterns and we can start using those into our development
and we can schedule it actually
infrastructure as a code for ETL development and infrastructure as a
code and principle transformation, ETL development for manual error
from process to automated repeated procedures to integrate it naturally
with the software development workflows.
Tag workflow.
And here there's three three points I like to discuss here.
One is Terraform modules and helm charts, GI Tops Flow.
The Terraform modules are encapsulated common infrastructure patterns for
data processing workflows, enabling teams to deploy the complex,
multiple, multi-service through the simple configuration declarations.
And by, by using helm charts, provide the Bernet native approaches to infrastructure
as code that align with containers on the data processing or pressures and
through the deployments and coming to top our workflows, infrastructure changes
triggered by the commit kit repositories, creating audit trials and enable
sophisticated approval workflow through.
Tools like ango, cd, and Flux.
Work for data quality.
That's this is the one of the critical role is the data assurance.
Data Quality testing represents fundamental requirement for
the enterprise data platforms.
Such traditional testing approaches often through.
In inadequate for modern data processing workflows.
Yes, comprehensive testing strategies includes, testing with data
generation and integrion testing in container based environments.
In my validation testing for back backward compatibility, performance
testing, performance regression testing with the baseline.
Comparison.
So by using these four levels of testing and make sure that whatever the data is
handling and processing as data quality would be data quality will bev perfect.
And unlike traditional software testing where the mock object can
simulate external dependencies, data crossing tests often required
representative to data sets that capture the complexity and each cases.
Present in the production data?
Yeah.
Production data means production data is final data, so it should
not any issues in errors in data.
So by doing the good quality of checks, we can avoid errors in the production, data
monitoring and observability of standards.
There are four ways of we are doing the monitoring and distributed testing
and mattress collections and log aggregations in digital directing.
By using these four we are monitoring and observability standards.
And here comprehensive monitoring and observability capitals are essential for
the maintaining reliable data platform.
Operating it enterprise scale.
Enterprise level, and distributing pricing is a critical for understanding
complex data processing workflows that span multi multiple service
services and processing stages.
System like Gear and Kin provides a detailed execution visibility and
went to mattress corrections time series and databases like influx db
provider efficient storage and querying cables of both technical performance
in indicators and business relevant process outcomes coming to the log
aggregations tool, like Elastic Research and Splunk, and high volume of data.
And log output with sophisticated such and analysis capabilities for rapid problem
identification coming to intelligent Yeah.
By aggregations, by using this bunk research is a very quick we can come
issue where exactly is happening, and intelligent alerting, sophisticated
alert correlation escalations policies like ML based anomaly detecting and.
Subject issues are reducing the, yeah real world implementation experience
practice implementation of self-service data platform across the, and diversity
enterprise and environment provides the valuable insights c to both
architectural pattern and organizational changes management strategies.
So here the industry specific challenges are included financial services, like
appliances, data governance and risk management requirements, and healthcare,
coming to healthcare like HIPAA complaints, patient privacy and Alexis
and banking fraud detections, realtime processing core system integrations.
And those we need to see alien operational s and disaster require.
And airlines, sorry.
Operational alliances and the disaster require capabilties.
Yeah.
And developer experience and load connections.
The source of self-service data platform ultimately depends on their ability
to reduce the loads for development.
Teams while providing the powerful data processing capabilities like lead tool
design, like well designated CLA tools for flow establishment conventions
for pat parameter handling, output formatting and error reporting with the
compress help system and conversions.
So by using this CLA digital tool so we can perform all this on.
Like handling output, formatting, error, reporting, all this
things, we can handle this.
And web interfaces must provide ation of complex data processing
workflow while enabling sophisticated ations and monitoring ties.
Sir documentation interactive document is system that combines explanatory content.
That executable example provides effective learning experience.
Governance and compliance and platform architecture, so data
lineage, audit, logging, and access control and privacy production.
So as I mentioned, so data governance and compliance is very key to the company.
Based on that, only if you speak openly, the stock market and all
the stuff will be based on that.
This ambulance only, so there is no data leak or they maintain all standards
so that all the business is doing.
All standards will be on, accounting will be there year and the data
line automatically capture the data flow information as a pipeline and
execute, maintain detailed records of source submissions and designation
for the compliance reporting.
And access controls integrated with the enterprise entity system while
providing the fine permissions through VAC systems for dynamic ations.
Audit logging capture the compress records for all user actions, data
access and configuration changes with sufficient detail for complaint
reporting, privacy protection, sorry.
Regulatory requirement like D-P-R-C-C-P-A through the authorization
data discovery classification, an normalizations and delete
capabilities, deletion capabilities.
So go data governance and complaints are the key for the company.
So by using these four things that, that these architecture office, like
the data line, access control, audit, logging, and privacy protections.
And this is the feature ions and conclusion of this task and
transformations of aging here.
The emerging patterns of our wireless computing for more
efficient resource affiliation one and machine learning integration
with model electric management.
Yes.
And real time processing and ties with the steam processing fire frameworks and
edge computing scenarios spanning cloud and edge environment and a port platform
operations for optimization monitoring.
And these are all patterns and new technologies and new to do the data
handlings and data transformations from the legacy to the new one.
To success.
Yeah.
Building the effective self service data plan is required, is a
holistic approach that balance the technical, sophisticated, and user
experience considerations the most.
Successful implementation treat platform development as a product
development with the clear user persons, interactive implement cycles,
and comprehensive feedback mechanism.
And operations organizations that successfully implement
comprehensive data platform.
That strategies, again, significant competitive advantage through the improved
developer productivity, faster time to market, and more reliable operations.
These are all key to success directions actually.
Yeah.
Thank you.
And thank you for giving opportunity to present myself and my work and giving
thoughts on it and I'm happy to share my experience and presentation here.
Thank you very much.