Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Today we're gonna go through some of the cloud data engineering, aspects.
and I'm gonna walk you through a couple of challenges.
couple of latest innovations and how cloud data engineering is transforming,
businesses across globally, and what are some of the benefits that
typically, businesses would see, in terms of, moving into the cloud.
Data engineering side of things, Mainly the cloud data engineering
is it's also driving innovation by enabling organizations to manage data
efficiently and cost effectively.
So let's take multiple platforms that we have today.
So we have AWS, we have GCP, we have Azure.
we also have several platforms that are built on top of, these different cloud
platforms like, snowflake, Databricks.
One of the very good examples that we could also consider is
Netflix, as a, very good use case.
So the kind of, so organizations like Netflix, they basically,
exemplify the cloud data engineering potential by delivering, highly
personalized content experiences to millions of users worldwide.
And they also very efficiently manage the data streams.
by leveraging the cloud technologies.
So you would've already known that architecture for Netflix is, it basically
runs on a. That's, that, that is something that kind of helped Netflix
to maintain that industry leadership by quickly adapting to the, consumer
trends, latest technology, maintaining, a serverless kind of an architecture.
so they were able to reduce the amount of investment that has to be
done from a physical hardware and then just have everything to cloud.
so that is, That is one very good example of how an organization has really
transformed, by moving away from the on-premises, hardware and then having
the data completely, residing in cloud.
So without any delay, let's get started with.
some of the, the challenges, what we typically see with the data on
a day-to-day basis, what is the whole cloud data engineering about?
And, what do we, Where do we stand today?
Right across globally.
So if you look at some of the metrics here, that I have, so
approximately 463 million terabytes of data generated globally.
and this data could be from various sources like video,
text, audio, streaming content.
It could be.
various number of, sources.
As you can see, it's four 63 million terabytes and this continues
to grow exponentially, right?
So this data explosion challenges racial data management methods, prompting
businesses to seek advanced cloud solutions for effective data handling.
the global cloud computing market is predicted to reach around 2.84
trillion by 2030, which is Less than a decade, like half of a decade.
to be, to put it precisely, this kind of highlights, this urgent
demand for robust data solutions.
we, with the recent, trend or with the recent, increase in the usage of
social media like, Twitter, Facebook, Instagram, TikTok, all of these
different platforms, they they widely demonstrate this, data exposure.
They generate and process vast amounts of data.
and it could be on a daily basis, on an hourly basis, it would basically
go on what's basically trending.
the data would exponentially increase within an hour or within a day.
and also, that is just one, given use case of how, the social
media platforms react to it.
But we do also have like several.
healthcare industries, that kind of heavily rely on the cloud technologies
to manage, the different patient records, apply some AI on top of
it to, AI and as well as big data.
AI has recently started, but I wanted to emphasize both on the AI and the
big data side of things where most of the patient records are managed very
efficiently and also, there are very, There are many predictive analytics or
predictive, applications that kind of, help to improve the patient outcomes
and also the swim and the operations.
Now, the other business plan that we're gonna talk about is also
on the financial institutions.
they also depend on several cloud solutions to detect the, real
time, fraud, risk management, and also the transaction processing.
So let's take an example for one of the large scale organization like Capital
One, which has a huge, very huge, base or AWS as we went through three different,
business lines, healthcare, financial, and as well as social media, this kind
of ex extends across multiple sectors.
What I have given is only a bunch of examples and sample.
Kind of bunch of examples, but it extends to multiple sectors.
this would basically import the organizations to turn
data into actionable insights.
maybe they could build some dashboards.
They could, build some analytics, or they could, build an application such a way
that takes a preventive action against it, Making the decision effectively
and driving the innovation as well, in wherever, areas as applicable.
So that is the main reason why embracing the cloud technologies becomes like really
critical for businesses that kind of seek, sustainable growth and operation
efficiency in a data driven organization.
so most of the organizations, I would say, In the recent times,
even if you take the large Fortune 500, most of them are data-driven.
I think that's how, most of the, large scale enterprises are expected
to be seen, that they're data driven.
and that's.
That's where the whole operational efficiency and everything comes into play.
Now, let's also, take, go through in the next slide on how, we wanted
to see infrastructure aspects of it.
Like how this whole thing had started, right?
So we.
When we go back to the old days, it was basically when, even within mar, within
modernization, if I were to procure a Linux server, someone has to build
a vblock for it, allocate some memory for it, there's a human effort that is
technically involved in such a situation.
it could take anywhere from, four weeks to eight weeks, right?
For a fiscal infrastructure to happen.
Now, when this whole cloud computing has started, the infrastructure
as a code, can, revolutionized how, infrastructure can be.
so it the traditional infrastructure management by providing, automated
infrastructure provisioning.
Through programmable scripts like Terraform for an example.
So Terraform primarily focuses on, having the infrastructure as a code,
automating the builds and everything.
So it, the main good part or the, the revolution that infrastructure
as a code has really brought in.
It would really transform how, businesses manage it, resources, implementing.
ISC would also reduce the need to physically deploy something, the amount
of time that is needed to deploy and also minimize the number of configuration
errors or, human error from mistakes.
Now, there could be some situations where.
When someone is doing the ISE code, basically, someone could
make a programmatic mistake, but then that's where we, we could
basically do some peer reviews.
we can basically have some checks against what we are trying to do
and then basically determine, detect all of these issues beforehand.
another example is also the A Ws cloud formation.
enabling enterprises worldwide to automate consistent infrastructure deployments
effectively across multiple environments.
when, if we go back to the previous slide where I was talking about Netflix.
So Netflix similarly utilizes ISC to manage, thousands of cloud
servers globally, because it's not.
It's not truly possible when you have a bunch of servers that are
lying across globally in different regions, in different zones.
and, keeping a tab of them unless it is truly automated and is truly,
code driven, which is the IAC.
Now, it also helps to significantly, enhance the disaster, recover
capabilities, reduces the system downtime.
And fosters better collaboration among, technical teams.
So one good example that I can provide is, in our organization
at least, we have basically seen a use case where, the most of the
infrastructure code is already ready.
We would just prepare a J and submit the J to the code and it
would build out a server for us.
so that's truly the innovation of how infrastructure as code has brought in.
That's truly a revolution that really happened when we go back in time and look
at how traditionally things were done versus how infrastructure as a code has
evolved over a period of time right now.
Let's talk about, modern data pipeline architecture.
so for the Modern Data Pipeline architecture, we know we have
like different forms of, data.
we have different forms of, data formats.
e either it could be CSV, it could be, xml, it could be Jason, The very beginning
need, to look at the data pipeline, side of aspects is, to consider that, we would
have to efficiently process different types of, different types of data formats,
getting in this set on a continuous basis.
So let's go through a couple of examples where, I can walk you through.
Excuse me, I can walk you through a very good example of, data tion and as well
as, the processing, the transformation, the storage, the storage optimization,
the analytics under consumption.
So let's take one single platform as an example.
So I wanna pick Snowflake for the purpose of our, When we start developing, a
Snowflake application, we would need an underlying cloud, storage platform.
Let it be A-W-S-G-C-P or Azure.
It doesn't really matter.
but for our discussion, we are gonna consider a Ws.
If I have five different, applications that are dumping data into an SL
bucket, which are of different data types, and I want to build something
that would technically, processes the data in a real time format, right?
So that is where all of this data would essentially come in,
as real time as possible from.
diverse sources, with scaling opportunities and capabilities as well.
now, the.
the data that has coming through as part of the data inges that has to be
processed, that has to be transformed.
so we would apply whatever, business logic or implementation that is applicable
and we would basically go about, running transformations accordingly.
Now, that is one very good advantage of how.
cloud data or cloud data engineering solutions will help us as we are
not restricted to any compute.
We would, basically have a lot of autoscaling, autoscaling of the warehouse
size, autoscaling of the instant size.
So we are not worried truly about, okay, I'm gonna hit about a, 50
million users within the next minute.
Now I have an on-prem infrastructure, I don't think my on-prem
infrastructure is gonna handle.
So we would technically procure another three or four different servers and then
just have them idle setting out there.
But in the case of cloud, the instances would autoscale to whatever size it needs
to be scaled, like from extra small to an extra large in the event the demand
increases and then scale back down.
So end of today, the main goal, is to bring in business efficiency in
terms of how we are operating it and if the scalability brings, the
bridge between, the user demand and the business efficiency.
I think that itself, the cloud data engineering, from Ws as an example.
It does a great job.
that's a real benefit and advantage of, the modern data architecture pipeline.
Now we are also gonna talk about storage optimization.
So storage optimization, technically, there is no retention period.
There is no cleanup that automatically happens on an on-prem.
storage systems.
So typically we would end up doing like just store the stale data over
and over a period of time and there's no controls or restriction around it.
So that is one big drawback of, having an, having an on-prem
system, but then with the cloud data engineering, with a multi-tier
architecture, we could have that automated data movement, by looking at.
some of the patterns that are really impacted, and.
the most important thing is the analytics and the consumption of the data.
So we could get like different sources of data, kind of data transform,
have, business logic applied to it.
feed the data into a Power BI dashboard or any analytics dashboard, that we, that
whatever the enterprise, plans to use it.
And then technically generate analytics out of it.
So this would also reduce, the need for writing custom scripts or custom data.
So in, if I were to go back in time and basically see, someone would have
to write a custom script to basically see, to generate those charts and
for analytics and to run SQL queries or, run any kind of data operations.
The data analytics is gonna really play important and crucial role, especially
for the financial organizations.
As I mentioned, a use case earlier.
So it's gonna truly bring out a lot of, value add.
No, elastic scalability benefits.
as I was talking about the scalability in the previous slide as well, I'm
gonna, cover this quickly because, went through a little bit deep previously.
resource optimization, having the dynamic allocation, ensuring the
perfect resource to workload matching.
There is the Netflix use case I was talking about.
the users can increase in any minute going from.
A hundred users to a million users.
So the resource optimization is the prime benefit of, having the
scalability within a cloud data pipeline.
and there's also a significant reduction in cost because the
most of the instances are all.
On demand.
so technically we could scale back to the lowest server as possible.
And for most part, all of the servers are like very, pretty high available.
they have a very good availability, because, most of the cloud platforms
have a dual zone, a deployment, a setup.
like for example, if I have a server in US East, I would also have AA in West.
If the, in the US East goes down a, that is really and readily.
To, to switch over the region and then give that seamless
experience to the user, right?
So those are mainly the elastic scalability benefits.
Now, we're gonna go through some of the, mission learning, operations,
and how ML lops acceleration can be achieved through the cloud data pipeline.
in a typical machine learning, application or, the lifecycle, we would need to
have, a model that we would develop.
the model would be continuously, improved upon based on user
feedbacks and user inputs.
We would also have to do some, monitoring around it, to track if the model is
deviating from any kind of responses.
ability to do the continuous integration.
let's say, let's take a use case where I just want to get, some data
generated through machine learning.
technically what are the, different steps that we go through.
We need a model.
it has, the model has to be developed and it, it has to be developed, it has
to evolve, it has to grow, and it has to get the feedback that it needs to get
for the model to operate efficiently.
Then we would go into the continuation, sorry, continuous
integration aspect where, it'll be integrated into multiple sources.
And also we would also enable like automated testing, which will basically
improve the model sufficiency and quality and the, reproducibility Then.
The deployment of the model itself.
Now, typically, as with any other software lifecycle, we would have a test
environment and a deploy environment.
And the deployment deploy environment or the production environment and production
environment is something that we truly really, roll it out to the users.
So that's, that's a deployment.
And the monitoring, of course, how the model is behaving, how does
it act in terms of the different, data flows that are coming in.
So how does this all tie up to the, cloud data?
Engineering as a whole is basically most of the cloud platforms today.
Right now, they have inbuilt integrated, ML ops capabilities.
So ML lops is an area where it is combining the machine learning and
the operation software together.
So it gives a great deal of, advantage, for folks who are truly
interested with the lops, development or, defining the morals as such.
Now, as, as much as everyone would be interested in moving to a
serverless architecture with less maintenance and everything, the other
main aspect that also truly really plays a role is how secure is, are
these cloud environments, right?
because we are truly relying on, based on the consumer versus a producer
or if something's developed inhouse as well, we depending upon, not
in-house, which is, which could be the case for an on-prem environment,
but we depending over the cloud, which could be a third party provider.
Now that, that kind of brings in, a very good, aspect of discussion,
which is about the security.
Now we could basically have, A couple of options here, that we
could expand upon how the cloud security can be, enforced, can be
made resilient and can be made better.
So one of them is being, is the AI powered threat detection.
we could develop like machine learning algorithms that basically identify and
as well as neutralize the accuracy, reducing, the security breaches.
So typically, the model or the machine learning algorithm gets fed of all the
different possible, scenarios around the different threats that are really for,
the different threats that could occur.
and the model is continuously monitoring the security, The model or the application
would continuously monitor the threats and the application with as much
maximum accuracy as possible, thereby reducing the security breach, right?
And the second option is we could follow a zero trust architecture.
this is basically for a strict, very, strict identity verification for every
user and device, regardless of network permission or resource location.
we have, two form factor authentication, multi-form factor authentication.
we have pink fed rate authentication where users would get an immediate
notification whenever they're trying to access any, any network objects
or any different application.
and every time.
Whenever user to login, it is always, continuously authenticating
based on Based on the requirements set by the organization.
So that is where the zero trust, architecture comes into play.
And then the third aspect is the compliance automation.
these are the intelligent systems that kind of continuously monitor,
document, and also enforce regulatory, requirements across multiple jurisdictions
with minimal human intervention.
So I'll give a classic example of how this gets played in a financial institution.
Institutions such as, post trade brokerages or, post
event reporting organizations.
They basically have a very huge need to comply with the regulator
laws and the regulator requirements.
often when we are sending the data back, to the regulators, it has to
be continuously, the, sent back in a, continuous feedback loop and.
For most part, the regulators would ask for, how strong the security of
the application is, how strong the network is, and things like that.
So this also falls under that compliance automation thing where we could
build some intelligent systems that kind of take care, of, identifying,
the, the cloud data pipeline.
Now moving on to the next one, we're gonna talk about the
inclusion storage management.
So with the storage management, for a traditional storage, we would just, have
a solid state device sitting somewhere on a rack, on a V block or something.
But, in the case of the cloud platforms, it.
Falls under different brackets or different ways.
So there could be auto tiering of the data.
the data can be classified into different deals, something that would change
frequently, something that would be stagnant for a longer period of time.
or something that is like a hybrid.
most of the modern cloud platforms leverage the storage
optimization techniques, that significantly reduce the cost while
improving the performance, right?
as I was saying, if some data is prone to change pretty frequently and we don't need
the data or a period of time, we could set up retention policies against that.
if we need some data to be available for a very long extended period of time,
let's say two years or three years, like a classic example of Netflix,
that is where we could use, glaciers Glacier type of storage, where, the
costs are significantly, low, but it is not highly performant or anything.
that's one of the catch that typically that we'll have to
keep in mind while dealing with, long retention data timeframes.
Now, if you look at the whole graph here, that I'm showing, the data
basically demonstrates how implementing a comprehensive, optimization strategy,
combining audit, hearing compression, and as well as intelligent data placement.
It would effectively reduce 30% of the cost reduction when compared
rotation storage approaches.
So this would also increase the data accessibility and the performance as
I was stating earlier, based on the use case, what we're trying to do.
it, it is really important on, how we structure, how we tier it and
how we, so that all comes into play.
We're gonna look at some of the computing advancements, especially
the edge computing advancements.
Now let's look at a couple of aspects here.
before the edge computing the data, travel long distances to
centralized cloud infrastructure.
there used to be a very high latency, that compromise real-time
applications, so bandwidth constraints, limited scalability,
and transmission costs primarily.
Now after the Edge implementation, every single data processing that is happening,
it is happening at the local network.
So let's say if my data center is sitting in, London, which is Cent
typically if we go by the areas, world.
Now, if I have something as a data center that is sitting in London,
my data has to transfer the network hub from the US to over the internet
to the Londons network, which could basically bring in a lot of latency.
But if I have a, a data center that is located like in the same state,
in the same region as well, then it's gonna basically, get the local
network parameter, the data accesses faster now, in, in correlation.
And to attest to the fact it would have an ultra response, 15 milliseconds
latency, which is like super low.
That enables the time critical operations.
So let's say, I wanna do a trade card transaction.
Now create card transactions cannot run for five or six seconds.
So it has to happen instantly.
So that is where, having the edge computing, advancement will help.
And the transmission costs are reduced.
it has an optimized bandwidth utilization that improves
network efficiency as aspect.
Yep.
Now the technology, convergence, impact.
So let's look at three different, aspects here.
we're gonna look at the operational excellence, the enhanced compliance,
and as well as the innovation catalyst.
So 90% of the faster data processing through seamlessly
integrated cloud technologies.
it would enable real time decision making capabilities, right?
So this was the data pipeline use case I was talking about
where, you can get the data from different sources combined together.
have a centralized data transformation, data application,
whatever we're trying to do, and then basically improve your business
operations and business efficiency.
Now the second part of it is the enhanced compliance, which kind of
falls under the automated governance and the regulatory, events, throughout
the, through the entire data cycle.
And the third one is the innovation catalyst.
acceleration, the time to market for data driven products, creating
sustainable, competitor advantages.
So let's take an example of, the social media platforms.
So let's say there are two or three of them, no, I wanna build a. There are,
we could use a mix of, machine learning.
we could use a mix of something, as well as architecture, and have
everything deployed in no time, right?
So it accelerates what the business really needs, in a very short span.
And also, as far as the costs are concerned, we're not paying anything
upfront, to procure the infrastructure.
And we're also saving a lot of time for a fact.
So let's say I need like about 20 different B blocks or, and a whole bunch
of network, routers and everything.
I would end up dealing with a lot of infrastructure teams
waiting and paying for the cost.
Even when I'm not really using them.
so that is where the innovation catalyst comes into play.
Where, most of the acceleration in time to market for data driven
products are data driven applications.
It really speeds it up because it's all, cloud driven, cloud data driven
and cloud infrastructure driven.
Now.
Someone could question back saying, how do I really make a move, for
an application that is sitting in a legacy environment to cloud?
So I think it is really important to assess, do an assessment of,
A very, comprehensive analysis of the existing data architecture.
How is the data flowing?
Is a database on-prem?
Are you using a database?
Do you need some local storage?
Do you need the cache?
how is the performance looking and what exactly are we trying to achieve with the
current application that we really have?
That comprehensive analysis of the data architecture underlying
performance gaps, and the high value optimization opportunities through
the stakeholder interviews and system audits is, is something that really
needs to happen now that serves as a foundation as a basement to get started.
Based on that, we would go about, moving into the strategy development aspects.
get some strategy flowing around it.
What type of cloud platform do you want use?
do you wanna go with A-W-S-G-C-P?
do you want to go with the cloud, driven database?
how do we deal with the data?
how do, how would your data pipeline be built?
get that strategy going, and get an estimate of the cost that might
incur as part of your strategy.
And also look for.
what is the net value realization that you're getting outta the cost?
So let's say if you're spending a million bucks today, that's, for an year.
And then if you see that, your cost could be reduced to 150
grand, for every year, then yes.
Yeah, that, that would be an absolute use case.
Why we would wanna move to cloud, right?
So getting the strategy development, design a cloud, customized cloud
migration roadmap, aligned with the business objectives, including the
technology, stakeholders, government.
Governance policies and also the regulatory compliance
and, implementation timeline.
And with the actual implementation itself, we would need multiple
teams, to get involved, get the cross-functional teams leverage DevOps.
How, how do we do infrastructure as code?
How do we code deploy to AWS, how do we, bring up, bring down the instances or,
get the regulatory, sorry, get the regular infrastructure support aspects as well.
The DevOps methodologies and also establishing the feedback
loops for continuous refinement and capability building.
And as I said earlier, the value realization, what is the cost
effectiveness that this is gonna bring us in terms of, moving to cloud.
So that basically summarizes the, the data transformation journey.
So that's pretty, that's pretty much I had.
I hope you all felt interested in what was, spoken so far.
please feel free to reach out, if you have any questions or any clarifications.
All right.
Thank you.