Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Building Secure AI Platform Infrastructure: A Zero Trust Framework for Reducing Security Incidents by 73% in Cloud-Native Environments

Video size:

Abstract

AI platforms are under siege—73% fewer breaches with Zero Trust! Live demos of battle-tested frameworks from 200+ deployments. Stop model poisoning, secure your ML pipelines, and sleep better at night. Real code, real results, zero fluff.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is bu. As a DevSecOps and multi-cloud architect, I spent the last 19 years helping companies architects secure and optimize their cloud infrastructure. I have had the opportunity to lead some incredible projects from innovating JA based solutions for AIOps to re-engineering high performance payment systems. The topic I'm here to discuss today is. One. I am deeply passionate about building secure AI platforms, especially as they become more central to our work these days. The rapid adoption of a is not just a technical technology shift, it's a fundamental change how we operate and with it comes a new set of security challenges. The traditional models simply can't handle. The race to deploy a at scale is on, but it's creating some major security challenges. We have always relied on traditional perimeter based security model, which assumes that anything inside our network is safe. That approach is simply not enough anymore, especially when a platforms span multiple clouds, edge locations and hybrid environments. The world way of thinking where we built a strong wall around our data center is completely inadequate when our data and models are everywhere. A workloads also introduce new. And unique threats that targets the very heart of the technology. These includes model poisoning where malicious data is used to corrupt a model during training data poisoning, which involves injecting bad data into the training set to manipulate the output advisable attacks, which are specially crafted input designed to cause a model to de misclassify. And model extraction attacks where an attacker reconstructs a proprietary model. By observing its outputs, the risk is where risks are high. A security incident could lead to compromised models, stolen intellectual property, and serious regulatory violations. The stakes are, have been higher. The solution is a fundamental shift to a zero trust model, security model. The model operates on principle of never trust, always verify. Instead of trusting based on the location, we verify every single transaction and interaction, we will explore how this perm shift. Can help us to build a resilient and secure platform that we need. As the diagram on the slide suggest zero. Trust is complete. Parm shift instead of the old method trust, but verify mindset, we now must assume every component, service and data flow is potentially compromised. This is critical distinction. We are shifting from reactive stance to proactive one. We are not just looking for threats, we are designing our system to be designed from scratch. This approach is built on three core principles. Verify explicitly, never assume trust based on location or IP address every request, every action. Must be authenticated and authorized use least privileged access grant, only the minimal permission necessary for the task. This dramatically reduces the potential blast radius of compromised account or service. Assume a breach always operate under the assumption that an attacker is already inside your network. This forces us to. Design our systems with containment and rapid response in mind. A workload make this particularly challenging. A single machine learning pipeline might involve dozens of interconnected services from data ingestion systems to feature stores, training orchestrators and model registries. Applying these principles to that kind of complexity is crucial. And it requires us to think about security in layers. The first and most critical layer of zero trust architecture is identity and access management. For a, this goes beyond just authenticating users. We need to manage the identity of services, models, and even data itself. This is because a workloads often involves a complex. Complex web of service to service communication, not just a person who logged in. Let's break in. Break down three key components here. Service identity management. Each service must processes a cryptographically verifiable identity that can be validated across all interactions. This is. Important as a workload span, multiple environment, mutual TLS authentication. This ensures that every communication between microservices is encrypted and authenticated. It's a fundamental way to make sure that only authorized services can talk to each other, preventing compromised service from freely moving throughout your network. Certificate management a's platform requires automated certificate rotation, a secure distribution, key distribution, and revocation mechanism that operates at scale. The challenge intensifies when training job may scale from single node to hundreds of workers within a minutes requires an identity system that can handle this dynamic scale. N Network segment. The second layer is network segmentation and mis microsegmentation. We need to move beyond simple VLANs to isolated individual workloads, tenant environments, and data processing stages. This is about creating secure isolated zones within your network regardless of physical location. This is tricky because a workloads have unique networking needs, like high throughput communication, CL For GPU communication, these requirements often conflicts with the traditional network security controls that might introduce an acceptable latency. The solution lies in using software defined networking. That can be dynamically create a security communication channels between authorized services while blocking all unauthorized access. The third layer is data security and model protection. This is arguably the most critical aspect of a security. A model is only trustworthy as the data it was trained on. And compromised data can result in biased, unreliable, or maliciously manipulated models. This s layer focus focuses on protecting the life blood of a systems, the data and models themselves. Let's look at three essential components. Data line is tracking. It is essential that every piece of training data be tracked back to its source with the cryptographic verification of its integrity through through throughout the process of pipeline. This is our primary defense against data poisoning attacks, feature storage, security. Centralized repositories of machine learning features require sophisticated access controls that understand the sensitivity of different data types. We need. We need fine grained permissions at feature level to prevent unauthorized access and extra filtration. So model specific attack protection. This involves defense a against model extraction attacks. That attempts to steal intellectual property and advisable attacks designed to cause model misclassification. Additionally, while standard encryption is must for a workloads, we must need to consider more advanced materials like homo graphic encryption, secure multi-party competition to protect data while it's being competed. The meta methods are complex, but offer a high degree of protection for sensitive competitions. Now let's combine the last two layers, runtime, security, and observability. Securing a platforms at runtime demand a different approach. A workload often run for extensive periods, consume significant resources and require. Privileged access to specialized hardware like GPUs, which increases the potential for a breach. Our monitoring must be accurately differentiated between legitimate behavior, like a burst resource consumption, and subtile signs of security breach. This leads us to the final layer, the comprehensive monitoring and observability. We need to track data quality, model performance, infrastructure, health simultaneously. This means using a specific monitoring. This tracks metrics that indicates potential security compromises, such as model accuracy, degradation, unusual query patterns, robust log aggregation. We need to process extensive logging output from triaging jobs, high volume access logs from model servicing systems and specialized GPU performance metrics at scale. Specific incident response, we must have a specific runbook for a scenarios. This includes procedures for model rollbacks. Responding to data breach that involves compromised training data and automated quarantine of compromised workloads. So where do you start? A phased incremental approach is key. You can't do everything at once. Service, identity and management begin here. By implementing mutual TLS authentication between AA services and establishing the certificate management process, this provides immediate security benefits network segmentation. Next, move to next network segmentation. Start with the broad network policies that isolate between different types of a workloads and then gradually implement more sophisticated microsegmentation. As you learn more about your traffic pattern, data security controls for data security, begin with the data classification and lineage tracking before you implement more complex controls like Homo Graphic encryption, our security multi-platform competition. Secure multi-party competition. Remember this is not just a technical challenge, it's a cultural one. You have to work closely with data scientist who may view security controls as absolute to rapid innovation, obstacle to the rapid innovation. The goal is to design security controls that enhance, not hinder their productivity. Now let's talk about the specific technologies and tools that are essential for building a security. Secure a platform based on zero test model. The technology choices you may significantly impact both of your security effectiveness and operative com operational complexity. Platform engineers must evaluate their tools for their compatibility with the, a specific requirement like GPU scheduling, high performance networking, and specialized storage systems. First consider service me Technologies. Tools like SDO are AWS app mesh provides more critical capabilities for service to service authentication and encryption. They are the backbone of Zero test network, allowing you to enforce granular security policies without changing the application code itself. However, it's vital to test your workload. To ensure that they don't introduce unacceptable latency that could cripple performance. Next, we have container orchestration. Kubernetes is the defacto standard, but for a you, your cluster often requires a specialized node pools with D resources, high bandwidth networking and large scale storage attachments. These are not your typical web server workloads and your orchestration platform must be able to manage these specialized hardware equipment effectively. Finally, you need to robust observability platforms. These platforms must be able to handle the unique monitoring requirements of a. This goes beyond standard CPU and me memory metrics to include include things like GPU, utilization training, loss calls and model servicing, latency distribution. Without this level of detail, you can't accurately monitor the health and security of a systems. A major concern with zero trust is the potential performance impact as a workload, incredibly sensitive, and to latency and throughput degradation. But there are ways to optimize encryption overhead. Select algorithm optimize algorithms optimized for high throughput scenarios. And leverage hardware acceleration encryption capabilities in modern processor processors and network cards. This moves the heavy lift from software to specialized hardware authorization latency, cash authentication tokens, and use efficient authorization engines. You can also consider. Authentication delegation where a trusted proxy service handle authentication for high frequency operations and reduce the workload on your servers services network policy impact. Benchmark your policy implementation with rep, representative workloads to find bottlenecks. Consider. A graduated security model where performance sensitive communication receive streamlined processing, monitoring overhead use a sampling strategy and a synchronous logging to maintain visibility without slowing down your systems. It's about getting the right data, not all the data, and processing it in a way that. Doesn't bottleneck your operations as a platform become more integrated into our businesses. They also fall under a growing number of complex regulatory frameworks. We are not just talking about traditional regulations like G-G-D-P-R, HIPAA anymore. We also have to deal with the new emerging a specific legislations. These frameworks require secure controls that not only offer technical protection, but also provide a clear path to comprehend com compliance verification. Some of the key challenges we face in this area include data resilience. Data residency requirements for global platform and the need of region specific security controls. The Zero trust approach helps us address these challenges, but also introduce a new complexities. For instance, audit logging become more even critical in a environment where a training processes. May run for an extended period and generate massive amount of operational data we need to able to process and analyze this data at scale. Furthermore, new a regulation regulations are introducing requirement for model com explainability and transparency. This means our security monitoring systems must be able to provide a detailed visibility into a model decision making process, all while protecting sensitive data. It's delegate balance between providing a transparency and maintaining confidentiality. Looking ahead. The in intersection of zero trust and AI security is evolving at a rapid space. To stay ahead of emerging threats platform engineers need to keep an eye on few key trends. First, there is a con confidential computing. This technology uses hardware based security capabilities to enhance. To enable security model training and interfa inference on under untrusted infrastructure. It's a game changer for data protection and workload isolation, as it fundamentally cha changes how we approach these problems. Next, we have a learning. This approach reduces debt exposure. To keep training data and decentralized, it creates a new challenge managing a security of complex distributed systems that require sophisticated security orchestration. We also seeing the rise of AI power, secu power security emerging tools can help automate zero security, zero trust, policy enforcement, and incident response. For AA platforms, this is crucial for handling the scale and complexity of modern systems and helps to reduce operational burdens on our teams. Finally, a more long term consideration is post quantum cryptography. While we may be years away for practical quantum threats, it is never too early for. Platform engineers to begin considering post quantum implementation to ensure long-term security of AA systems. In summary, implementing zero trust for AI platform represents a fundamental shift in how the organization approach infrastructure security. We can't rely on world security models. We must build resiliency into infrastructure from the ground up. The unique characteristics of a workloads require these specialized approaches. Success requires close collaboration between platform engineering, security, and a development teams. Security cannot be afterthought. It might, it must be designed into the platform architecture from the ground up, the investment payoff in many years, in many ways. You may gain better visibility into your a operations. Your compliance posture improves and you achieve more reliable system performance is to focus on incremental improvement rather than wholesale transformation. Each security control you add provides immediate value while building towards more comprehensive architecture. Looking ahead. We'll see emerging technologies like confidential computing and federated learning fundamentally change the, how we think about data production and a power security tools will help us to automate policy enforcement at scale. Okay. Thank you for listening to me. If you have any questions, you can reach out to me in my LinkedIn at through email.
...

Sudheer Obbu

Senior Lead Software Engineer @ JPMorgan Chase

Sudheer Obbu's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content