Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is bu.
As a DevSecOps and multi-cloud architect, I spent the last 19 years
helping companies architects secure and optimize their cloud infrastructure.
I have had the opportunity to lead some incredible projects
from innovating JA based solutions for AIOps to re-engineering high
performance payment systems.
The topic I'm here to discuss today is.
One.
I am deeply passionate about building secure AI platforms,
especially as they become more central to our work these days.
The rapid adoption of a is not just a technical technology shift, it's a
fundamental change how we operate and with it comes a new set of security challenges.
The traditional models simply can't handle.
The race to deploy a at scale is on, but it's creating some
major security challenges.
We have always relied on traditional perimeter based security model,
which assumes that anything inside our network is safe.
That approach is simply not enough anymore, especially when a
platforms span multiple clouds, edge locations and hybrid environments.
The world way of thinking where we built a strong wall around our data
center is completely inadequate when our data and models are everywhere.
A workloads also introduce new.
And unique threats that targets the very heart of the technology.
These includes model poisoning where malicious data is used to corrupt a
model during training data poisoning, which involves injecting bad data
into the training set to manipulate the output advisable attacks, which
are specially crafted input designed to cause a model to de misclassify.
And model extraction attacks where an attacker reconstructs a proprietary model.
By observing its outputs, the risk is where risks are high.
A security incident could lead to compromised models, stolen
intellectual property, and serious regulatory violations.
The stakes are, have been higher.
The solution is a fundamental shift to a zero trust model, security model.
The model operates on principle of never trust, always verify.
Instead of trusting based on the location, we verify every single
transaction and interaction, we will explore how this perm shift.
Can help us to build a resilient and secure platform that we need.
As the diagram on the slide suggest zero.
Trust is complete.
Parm shift instead of the old method trust, but verify mindset, we now must
assume every component, service and data flow is potentially compromised.
This is critical distinction.
We are shifting from reactive stance to proactive one.
We are not just looking for threats, we are designing our
system to be designed from scratch.
This approach is built on three core principles.
Verify explicitly, never assume trust based on location or IP
address every request, every action.
Must be authenticated and authorized use least privileged access grant, only the
minimal permission necessary for the task.
This dramatically reduces the potential blast radius of
compromised account or service.
Assume a breach always operate under the assumption that an attacker
is already inside your network.
This forces us to.
Design our systems with containment and rapid response in mind.
A workload make this particularly challenging.
A single machine learning pipeline might involve dozens of interconnected
services from data ingestion systems to feature stores, training
orchestrators and model registries.
Applying these principles to that kind of complexity is crucial.
And it requires us to think about security in layers.
The first and most critical layer of zero trust architecture is
identity and access management.
For a, this goes beyond just authenticating users.
We need to manage the identity of services, models, and even data itself.
This is because a workloads often involves a complex.
Complex web of service to service communication, not
just a person who logged in.
Let's break in.
Break down three key components here.
Service identity management.
Each service must processes a cryptographically verifiable identity that
can be validated across all interactions.
This is.
Important as a workload span, multiple environment, mutual TLS authentication.
This ensures that every communication between microservices
is encrypted and authenticated.
It's a fundamental way to make sure that only authorized services can talk to each
other, preventing compromised service from freely moving throughout your network.
Certificate management a's platform requires automated certificate
rotation, a secure distribution, key distribution, and revocation
mechanism that operates at scale.
The challenge intensifies when training job may scale from single
node to hundreds of workers within a minutes requires an identity system
that can handle this dynamic scale.
N Network segment.
The second layer is network segmentation and mis microsegmentation.
We need to move beyond simple VLANs to isolated individual workloads, tenant
environments, and data processing stages.
This is about creating secure isolated zones within your network
regardless of physical location.
This is tricky because a workloads have unique networking needs, like
high throughput communication, CL For GPU communication, these requirements
often conflicts with the traditional network security controls that might
introduce an acceptable latency.
The solution lies in using software defined networking.
That can be dynamically create a security communication channels
between authorized services while blocking all unauthorized access.
The third layer is data security and model protection.
This is arguably the most critical aspect of a security.
A model is only trustworthy as the data it was trained on.
And compromised data can result in biased, unreliable, or
maliciously manipulated models.
This s layer focus focuses on protecting the life blood of a systems,
the data and models themselves.
Let's look at three essential components.
Data line is tracking.
It is essential that every piece of training data be tracked back to its
source with the cryptographic verification of its integrity through through
throughout the process of pipeline.
This is our primary defense against data poisoning attacks,
feature storage, security.
Centralized repositories of machine learning features require sophisticated
access controls that understand the sensitivity of different data types.
We need.
We need fine grained permissions at feature level to prevent unauthorized
access and extra filtration.
So model specific attack protection.
This involves defense a against model extraction attacks.
That attempts to steal intellectual property and advisable attacks designed
to cause model misclassification.
Additionally, while standard encryption is must for a workloads, we must need
to consider more advanced materials like homo graphic encryption, secure
multi-party competition to protect data while it's being competed.
The meta methods are complex, but offer a high degree of protection
for sensitive competitions.
Now let's combine the last two layers, runtime, security, and observability.
Securing a platforms at runtime demand a different approach.
A workload often run for extensive periods, consume
significant resources and require.
Privileged access to specialized hardware like GPUs, which increases
the potential for a breach.
Our monitoring must be accurately differentiated between legitimate
behavior, like a burst resource consumption, and subtile
signs of security breach.
This leads us to the final layer, the comprehensive
monitoring and observability.
We need to track data quality, model performance, infrastructure,
health simultaneously.
This means using a specific monitoring.
This tracks metrics that indicates potential security compromises, such
as model accuracy, degradation, unusual query patterns, robust log aggregation.
We need to process extensive logging output from triaging jobs,
high volume access logs from model servicing systems and specialized
GPU performance metrics at scale.
Specific incident response, we must have a specific runbook for a scenarios.
This includes procedures for model rollbacks.
Responding to data breach that involves compromised training data and automated
quarantine of compromised workloads.
So where do you start?
A phased incremental approach is key.
You can't do everything at once.
Service, identity and management begin here.
By implementing mutual TLS authentication between AA services and establishing
the certificate management process, this provides immediate security
benefits network segmentation.
Next, move to next network segmentation.
Start with the broad network policies that isolate between different types of
a workloads and then gradually implement more sophisticated microsegmentation.
As you learn more about your traffic pattern, data security controls for
data security, begin with the data classification and lineage tracking
before you implement more complex controls like Homo Graphic encryption,
our security multi-platform competition.
Secure multi-party competition.
Remember this is not just a technical challenge, it's a cultural one.
You have to work closely with data scientist who may view security controls
as absolute to rapid innovation, obstacle to the rapid innovation.
The goal is to design security controls that enhance, not
hinder their productivity.
Now let's talk about the specific technologies and tools that are
essential for building a security.
Secure a platform based on zero test model.
The technology choices you may significantly impact both of
your security effectiveness and operative com operational complexity.
Platform engineers must evaluate their tools for their compatibility with
the, a specific requirement like GPU scheduling, high performance networking,
and specialized storage systems.
First consider service me Technologies.
Tools like SDO are AWS app mesh provides more critical capabilities for service
to service authentication and encryption.
They are the backbone of Zero test network, allowing you to enforce
granular security policies without changing the application code itself.
However, it's vital to test your workload.
To ensure that they don't introduce unacceptable latency
that could cripple performance.
Next, we have container orchestration.
Kubernetes is the defacto standard, but for a you, your cluster often
requires a specialized node pools with D resources, high bandwidth networking
and large scale storage attachments.
These are not your typical web server workloads and your orchestration platform
must be able to manage these specialized hardware equipment effectively.
Finally, you need to robust observability platforms.
These platforms must be able to handle the unique monitoring requirements of a. This
goes beyond standard CPU and me memory metrics to include include things like
GPU, utilization training, loss calls and model servicing, latency distribution.
Without this level of detail, you can't accurately monitor the
health and security of a systems.
A major concern with zero trust is the potential performance impact as
a workload, incredibly sensitive, and to latency and throughput degradation.
But there are ways to optimize encryption overhead.
Select algorithm optimize algorithms optimized for high throughput scenarios.
And leverage hardware acceleration encryption capabilities in modern
processor processors and network cards.
This moves the heavy lift from software to specialized hardware
authorization latency, cash authentication tokens, and use
efficient authorization engines.
You can also consider.
Authentication delegation where a trusted proxy service handle authentication
for high frequency operations and reduce the workload on your servers
services network policy impact.
Benchmark your policy implementation with rep, representative
workloads to find bottlenecks.
Consider.
A graduated security model where performance sensitive communication
receive streamlined processing, monitoring overhead use a sampling strategy and a
synchronous logging to maintain visibility without slowing down your systems.
It's about getting the right data, not all the data, and processing it in a way that.
Doesn't bottleneck your operations
as a platform become more integrated into our businesses.
They also fall under a growing number of complex regulatory frameworks.
We are not just talking about traditional regulations like G-G-D-P-R, HIPAA anymore.
We also have to deal with the new emerging a specific legislations.
These frameworks require secure controls that not only offer technical protection,
but also provide a clear path to comprehend com compliance verification.
Some of the key challenges we face in this area include data resilience.
Data residency requirements for global platform and the need of
region specific security controls.
The Zero trust approach helps us address these challenges, but
also introduce a new complexities.
For instance, audit logging become more even critical in a environment
where a training processes.
May run for an extended period and generate massive amount of operational
data we need to able to process and analyze this data at scale.
Furthermore, new a regulation regulations are introducing requirement for model
com explainability and transparency.
This means our security monitoring systems must be able to provide a
detailed visibility into a model decision making process, all
while protecting sensitive data.
It's delegate balance between providing a transparency and
maintaining confidentiality.
Looking ahead.
The in intersection of zero trust and AI security is evolving at a rapid space.
To stay ahead of emerging threats platform engineers need to
keep an eye on few key trends.
First, there is a con confidential computing.
This technology uses hardware based security capabilities to enhance.
To enable security model training and interfa inference on under
untrusted infrastructure.
It's a game changer for data protection and workload isolation,
as it fundamentally cha changes how we approach these problems.
Next, we have a learning.
This approach reduces debt exposure.
To keep training data and decentralized, it creates a new challenge
managing a security of complex distributed systems that require
sophisticated security orchestration.
We also seeing the rise of AI power, secu power security emerging tools can
help automate zero security, zero trust, policy enforcement, and incident response.
For AA platforms, this is crucial for handling the scale and complexity of
modern systems and helps to reduce operational burdens on our teams.
Finally, a more long term consideration is post quantum cryptography.
While we may be years away for practical quantum threats,
it is never too early for.
Platform engineers to begin considering post quantum implementation to ensure
long-term security of AA systems.
In summary, implementing zero trust for AI platform represents a fundamental
shift in how the organization approach infrastructure security.
We can't rely on world security models.
We must build resiliency into infrastructure from the ground up.
The unique characteristics of a workloads require these specialized approaches.
Success requires close collaboration between platform engineering,
security, and a development teams.
Security cannot be afterthought.
It might, it must be designed into the platform architecture from the
ground up, the investment payoff in many years, in many ways.
You may gain better visibility into your a operations.
Your compliance posture improves and you achieve more reliable system performance
is to focus on incremental improvement rather than wholesale transformation.
Each security control you add provides immediate value while building towards
more comprehensive architecture.
Looking ahead.
We'll see emerging technologies like confidential computing and federated
learning fundamentally change the, how we think about data production and a
power security tools will help us to automate policy enforcement at scale.
Okay.
Thank you for listening to me.
If you have any questions, you can reach out to me in my LinkedIn at through email.