Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello all.
I'm Nata Kala.
As the presentation title states Building Resilient Machine Learning Cloud
integrations, a practical guide to secure, scalable and cross platform architectures.
Today we'll be seeing the robust, practical guide for machine learning
infrastructures that we have in place.
Very first slide is talking about multi-cloud reality.
What exactly is multi-cloud?
Machine learning is very much evolving in today's world, as we all know.
And to deploy machine learning solutions.
All the companies are widely using three plus cloud services.
So it could be two plus, it could be one plus but it is a
multi-cloud, I would say overall.
So using this multi-cloud cloud solutions or cloud deployment
models definitely has advantage.
The very first advantage is that we were able to get get over vendor lockin.
So vendor lockin is no longer happening.
That's great.
But what is the con of this multiple cloud services that we are incorporating into?
Very firstly, operational efficiency.
How about we going to operate the instances of all these clouds and also
we will be running into issues around compliance and data security and privacy.
The overall spendings into the cloud by 2028 is going to cross 1 trillion.
Just imagine that 10,000 companies are today are trying to invest.
Into multiple clouds, like 10,000 into three is already 30,000.
And each of them pays their amount depending on the scale of the solutions
that they have in place that is like.
A lot to be honest.
And when we are trying to spend a lot I think we have to make sure that
whatever we try to spend is worthy.
It's deployed in the right way and the architecture is all array in
the laid in the correct manner.
And overall, today's percentage is crossing 75%.
So exactly to be more precise, it's 78%.
So we have many ML deployments that are taking place across multiple
clouds and not through a single cloud.
Let's move on.
Let's take a look at cloud deployment models.
What all models do we have in place?
We have infrastructure, we have platform, and we have software as a service.
So all these cloud models are widely used.
For different purposes and mostly infrastructure means that
they have virtual machines and networks for complete control over
machine learning environments.
So this is best for custom frameworks and specialized workloads.
And one of the major example in this area is AWS EC2 have, they're going to, give
all the machine learning solutions, right?
And platform as a service for platform as a service.
We have AWS SageMaker in place which is a good example.
And also IBM Watson Studio, along with Azure ML and Google Vertex ai.
So this is very much useful for development teams to
focus on model creation.
Mostly we'll be having all the ML frameworks as managed.
So we don't have to really worry about building something from scratch because
we already have all the building built-in scaling that is enabled for us.
Software as a service.
Software as a service is something, so it is using minimal minimal configuration.
So it is definitely perfect for specific use cases.
For example, AWS recognition and Azure Cognitive Services and
global Vision AI are providing very good service in this area.
So all this minimal configuration means that you can just drag and
drop most of the components which are out of boxing nature, all and out
of boxing nature, and already built.
Nextly ML API Integration procedures.
Today we have a lot of procedures that we are following.
One is rest API, so rest API is a very robust.
API, it can be very much used by mobiles and it's super lightweight.
It is stateless architecture and all the endpoints or all the callouts
are made on HT TP based endpoints.
So it is definitely secure.
And also json and XML responses are all accepted.
That means that it can work with wide architectures and it can work
very easily with third party APIs.
APIs and third party coding frameworks as well.
Nextly, GraphQL.
So GraphQL is definitely growing.
It also has a principles which are similar to rest API and the flexible data
queries option is available in GraphQL.
So data retrieval will be very much very much precise and also it is stateless
and we'll still have a single endpoint.
And the very much advantage of GraphQL is that it is it has.
Reduced the over fetching limit, right?
So you don't have to over fetch the data.
It is taking care of all the data and giving us the right data we need.
And also it is processing machine learning in the right way.
So this GraphQL has been recently integrated with Salesforce.
Salesforce is one of the software as a service.
Platform.
And it is mostly used on the JavaScript side of things, and it is very much
robust in connecting to the server and getting all the data that we need.
Next slide.
GRPC.
It is a high performance, binary communication communication
integration approach that we take, and mostly it'll help us with the
streaming support that we need.
And also the latency is very low making, is very marrying,
making it very advantaged.
For handling large volumes of data.
And last but not least, is event driven.
This is very much required to make sure that all the asynchronous
machine learning are processing at scale because it handles.
Super large data volumes, millions and billions of data.
And also one of the main one of the major example in this area is publish
subscribe, pattern Q based processing, which is a synchronous in nature.
And also we do have serverless triggers, which is more advantages.
So by using the right API procedures, we can definitely improve the
efficiency of overall overall model.
And Nextly Identity and access management.
We were talking about security breaches and potential issues around identity.
So this is this model has a pyramid shown here in the image and from bottom to top.
If you can look at it, the top is having least privilege access.
That means you are minimizing all the permissions to
what's absolutely necessary.
Only so least privilege Access is mostly preferred and multi-factor authentication
and role-based and centralized identity.
All of these factors should be considered when we are designing a model.
The implementation implementations.
With organizations, with implementations of IAM definitely we'll be
reporting less than 50% breach cost.
So we just need to implement it in a right way.
And I think that role-based multifactor and centralized identity,
so for example, your resellers can definitely get a centralized identity
and multifactor authentication, but have a guest user is trying to.
Access the application.
Our overall model is where the where the issue comes from.
We have to manage it depending on the type of the user who is
trying to access the application.
And nextly zero trust, security model.
That means do not trust, just keep validating and validating every time
you try to, or you try to make a. Trust to, it could be to a person or
it could be to an, to other device.
So firstly, yes.
Verify identity, authenticate every user, and service accessing ML resources.
That's very much necessary.
And nextly, you have to validate the device as we discussed.
So keep checking the device.
It could be via any kind of authentication practices that we discussed in the
past slide, and a limit access just.
Provide just in time, just enough permissions, usually for limiting access.
We also have many handlers in place that we can use.
In the models we can incorporate them and nextly monitor activity.
So continuously do event monitoring to analyze the behavior for anomaly, analyze
the behavior for fraud detection, and also anom, analyze anom, analyze it for
most of the issues related to duplicates.
So all these anomalies need to be identified in the right manner and at
every step I would strongly recommend us to have monitoring activity.
It could be via interactive dashboard, it could be via report, but this is
very much necessary and most of the platforms are also providing this
out of box which we can leverage.
I think zero trust definitely eliminates the implicit trust from
machine learning architectures.
Every request must be verified regardless of source.
And training data production this is very much vital for us to get
rid of all the privacy concerns that especially we have in place.
So end-to-end encryption is very much needed.
So we have to make sure all the ML data in transit addressed.
And in use is encrypted correctly.
So we have to include all the key rotation.
We have to include all the shark keys, SHA keys all the
encryption practices in place.
And very much as, as much as we can try to serialize and also keep it keep it
secure via key authentic authenticated key and data sovereignty controls.
So mostly if we have to respect the geographical data restrictions,
through regional storage, we have to implement cross border transfer
mechanisms whenever necessary.
And dataset access monitoring.
We have to track all the interactions with training data, implement.
Alerting for unusual access patterns that might indicate data leakage.
Data leakage can happen at any point of time if we do not have
proper dataset access monitoring techniques laid in place.
Differential privacy we have to add a statistical.
Noise to predict individual records, balance privacy with accuracy needs
for sensitive ML applications.
This will also be indicating that we need to have the proper IAM mechanisms
in place and nextly compliance.
We spoke about compliance issues that today's world is facing.
And why is this happening?
Because with the cross with the cross cloud, architecture we have in
place, we are facing HIPAA concerns.
We are facing GDPR compliance effect EU, PCI and CCP.
So all these laws are all these laws are legal and ethical, and all the ML
architectures that we are developing should comply with each of these laws.
So GDPR is mostly data subject rights and AI transparency requirements
that affect the model design.
And hipaa.
HIPAA is mostly healthcare.
You'll be, each of us will be signing a HIPAA consent when we go to a doctor.
So all the hipaa, HIPAA consent practices should be.
Laid in our infrastructure and platform should be encrypted.
We have to use SHIELD platform encryption, and we have to use
other encryption techniques that are available for us to encrypt all
the data and make sure that what is available and what is not available
depending on the type of the users.
EU AI Act is a risk-based approach to regulating ML systems by application
and the payment secure handling of financial data used in fraud detection
models that we need that model.
And also the C CCPA, right Consumer rights to data access and
deletion impact training process.
So all these considerations have to be considered and nextly cost
optimization strategies, because we are spending a lot, what are we really
trying to, in terms of optimizations?
So leverage all these spot instances.
So that means we have to make sure all the non-critical training
job on discounted compute.
You just have to defer all the computations of non-critical training
jobs to make sure that we will be using spot instances and nextly.
We have to optimize the storage tiers.
Meaning if your training data is not accessing any data frequently, then
you have to move it to a cold storage.
And also auto-scaling.
Meaning scaling.
We have to scale the resources based on prediction demand.
That's very much vital and right size, compute resources, match
instance types to ML workload needs.
You just don't need it.
Oversized.
But you wanted to have it in a right size so that you'll be saving on all
the space complexities that we'll be having and properly configured.
ML environments can reduce the cloud spending by 30 to 40%, which is a lot.
And also regular cost or the audits should be taken place so that we
will discuss all the optimizations with them and we can always.
Try to optimize as much as we can if you are missing on anything.
Cross-platform model service containerization.
So package ML models and dependencies for consistent deployment anywhere.
One of the example is Docker and Kubernetes because they
are already providing platform agnostic or orchestration.
Next list serverless in.
Inference, deploy models and functions that automatically scale with demand.
And also we only pay for actual prediction time with minimal management overhead.
La Lastly, we'll be discussing about model versioning and performance monitoring.
Performance monitoring is, again, we have to track all the model
metrics consistently consistently across multiple crowd providers.
And related to versioning, we have to implement all the releases for safe
model updates across platform platforms.
And real time machine learning integration patterns.
So what are the integration patterns that we have in place?
Synchronous API synchronous processing and stream processing synchronous.
API example is again a rest.
We could think of it as just using a rest, API in place.
So direct request response pattern for immediate prediction.
That means you will fe you'll fetch wait and get the response
and show it back to the user.
So the latency is very less under a hundred MS. And a synchronous processing.
This could be even driven architecture mostly where we'll be using Q based
inference for high throughput workloads.
Ideal for batch predictions and background trust.
Scales easily during peak demands and nextly stream processing.
That is a con continuous inference on data streams.
So you keep streaming the data and you keep receiving and the
people can subscribe and they can hear the real time data from us.
Disaster recovery.
This is very much needed.
What exactly is a disaster recovery?
Let's say we have a natural calamity or, immediate power interruption.
What are we going to do?
So disasters are something even more bigger than power
interruption, like I mentioned.
It could be a calamity or it could be even more a system failure altogether, right?
What we have to make sure all the models have these, strategies incorporated.
So model registry replication, that is where you automate all re registry
synchronization, implement cross region validation, which is more required.
What if other region fails and one region is up?
And also version control all model artifacts.
You can use any GitHub, you can use Bitbucket and also others.
Source controls and version controls that we have today in place.
Multi-region deployment using global load balances, which is very
much vital for a right model and implementing health checks as required.
And test regional is isolations regularly.
Health checks are mostly part of many of the software as a service.
Software as a service platforms, they have out of box health
checks that we can leverage.
But if we do not have one, we have to make sure we'll build one
depending on the org requirement.
And degrade degraded operation model operation modes, right?
So we have to develop simplified fallback models.
Meaning you can build a dead letter queue, you can build a retry mechanism,
but that is more important and also cashier recent predictions and
creating rules based alternatives.
You just need to make sure how you redirect.
Your Q2 or how you are going to redirect your model to, that's
more vital for the machine learning system to work in the right way.
And that represents the cleaner design.
And also the key takeaways are we have discussed about training databases.
We have discussed about how engineer resilient.
Cloud integrations through a PA architectures and also through IAM
protocols implement comprehensive safeguard for training data while
navigating complex compliance requirements across multiple regions.
And also we have to architect our systems to build built in with.
Scalability, and we have to make sure that cost optimization strategies are in place.
And also there is the disaster recovery mechanisms, which includes
deadline queues and more to ensure op operational continuity.
That's mostly about my presentation.
And thank you all for your patience today.
I really enjoyed talking in the conference.
Thank you all.