Scaling Kafka on Kubernetes: Cloud-Native Streaming for 100,000+ Messages Per Second

Video size:

Abstract

Transform your Kafka deployments with proven Kubernetes patterns that deliver million+ msg/sec throughput! Learn StatefulSet optimization, pod autoscaling, and GitOps workflows that slash deployment time while maintaining sub-50ms latency. Real-world strategies for production streaming at scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Today I'm going to share practical insights on scaling Kafka on Kubernetes to handle extreme streaming workloads, specifically how to architect and optimize for over a hundred thousand messages per second in production environment. I'm an independent researcher focusing on cloud native architecture and distributed systems. This presentation draws from real world implementation experience with high throughput streaming platforms. Let's start with the problem space. Modern containerization application are generating unprecedented data volumes. We are talking millions of messages per second that need real time processing. The challenge isn't just about raw volume. We need Kubernetes native streaming platforms that can scale dynamically with workload demands while maintaining stateful streaming workloads with low latencies and high throughput. This intersection of stateful applications and container infrastructure creates unique architectural challenges that we will address today. So why Kafka Kubernetes in the first place? Firstly, continued ization. Kubernetes gives us automatic scaling, self-healing, and sophisticated resource management. For Kafka clusters. These capabilities are built into the platform rather than requiring custom tooling. Second cloud native benefits. We get seamless integration with cloud services, persistent storage solutions, and service mesh architectures. This integration reduces operational complexity significantly. Third, dynamic scaling horizontal pod auto scaling lets our Kafka deployment adapt to traffic patterns automatically while maintaining streaming performance. This elasticity is critical for cost optimization and handling variable workloads. The foundation of any successful Kafka on Kubernetes deployment rests on three pillars, stateful set configuration, which is essential. We deploy Kafka brokers as stateful sets, ensuring order deployment, stable network identities, and position storage attachment. This is critical because Kafka brokers are inherently stateful. Our persistent volume strategy uses optimized storage classes, specifically SSD, backed volumes with high IOPS to meet Kafka's demanding log segment performance requirements and low latency access patterns. Finally, service mesh integration provides secure interpod communication with traffic management and comprehensive observability. This gives us encrypted communication and detailed metrics without modifying the application code. Resource allocation requires careful tuning. Here's what works in production. CPO requests should be two to four cores per broker. This provides this, provide enough processing power for high throughput workloads without. Over provisioning memory allocation is critical. We typically configure eight to 16 GB heap sized optimization per broker. The exact size depends on your platform count, partition count, and message retention policies. Resource limits are equally important to prevent noisy neighbors in multi-tenant clusters. We use the guaranteed QOS class for critical Kafka workloads, ensuring resources are in preempted under cluster pressure. Proper resource allocation ensures consistent performance under varying loads while preventing resource contention. This is specifically important in shared cluster environment Storage is where many Kafka on Kubernetes deployment fail. Let me walk through our three part strategy First storage classes. We use SSD based storage classes with high. IOP is specifically log segments. Kafka's sequential right patterns benefit anomalously from fast storage and the low latency access patterns are essential for the consumer performance. Second volume configuration. Persistent volume claims must be sized according to your retention policies and replication factors. Undersizing here leads to frequent disc pressure issues and broker instability. And the third one, the performance tuning file system optimization and mount options dramatically impact throughput. For example, using XFS with specific mount options can improve right performance by 30 to 40% compared to the default configurations. Network performance is often an overlooked bottleneck. We address this through three mechanisms. Network segmentation uses isolated network policies for Kafka clusters. This ensures security without performance degradation a common issue when network policies are too restrictive. Load balancing configuration is optimized for streaming workloads with session affinity. This is important because Kafka clients benefit from sticky connections to specific brokers. Bandwidth optimization involves network performance tuning for high throughput message processing. This includes TCP buffer tuning network, plugin selection, and sometimes dedicated network interfaces for Kafka traffic. Auto scaling stateful applications like Kafka requires a nuanced approach. Custom metrics are key standard CPU and memory metrics. Don't tell the full story. We implement HPA based on Kafka specific metrics like consumer-like and partition throughput. These metrics directly indicate when more capacity is needed. Our scaling policies use intelligent algorithms that account for the stateful nature of the Kafka brokers. You can't just terminate brokers instantly. Partition leadership needs to be transferred. Replicas need to be reassigned. The scaling policy must respect these operational requirements. Admins traffic adaption also ensures dynamic adjustment to varying message volumes while maintaining partition balance across the clusters. Unbalanced partitions can create hotspots that negate the benefits of scaling. Okay. Multi cluster deployment. Enterprise deployments typically require multi Kafka clusters. We manage this through helm charts, which provide the standardized deployments across environments with configurable parameters. This consistency reduces operational lenders and enables a rapid deployment in new regions. Global distribution implements like cross region application for disaster recovery and reduced latency users in different geographic regions can consume from the local clusters while maintaining data consistency, ease maintained through mirrormaker, which provides reliable cluster data synchronization with exactly one somatics. This is critical for maintaining data integrity across regions. Operational excellence requires treating infrastructure as code. Our GitHub's workflow has three components. Infrastructure as code means all Kafka cluster configurations are managed through kit repositories with version control. Every change is tracked, reviewed, and audited. Automated deployments use CICD pipelines that trigger cluster updates based on configuration changes. This eliminates manual deployment steps and associated errors. Rollback capabilities are built in. We implement safe deployment practices with automated rollbacks on configuration errors. If a change causes cluster instability, we can automatically revert within minutes rather than ours. You can't manage what you can't measure. Our monitoring Slack stack has three layers. Prometheus integration, which export JMX metrics for comprehensive cluster monitoring, we track broker health partition metrics, consumer group lag and resource utilization. Grafana dashboards provide real time visualization of throughput, latency, and resource allocation. Operators can quickly identify bottlenecks and trending issues. Alerting rules, implement proactive notification for performance degradation and capacity thresholds. We alert before users experience issues, not after. This includes alerts on consumer lag, spikes under replicated partitions, and this space thresholds. Now let's talk some numbers in production deployments. Using these patterns, we consistently achieve hundred thousand messages per second. Sustained throughput with optimized configurations. This isn't burst traffic. This is a continuous processing under production load. Less than five millisecond P 99 latency maintained even under very high loads. Low latency is very critical for real-time applications and achieving sub millisecond P 99 at the scale demonstrates the effectiveness are optimization strategies. 99%, 99.9% availability With zero down to deployments, we can perform rolling upgrades, scale clusters, and handle load failures without dropping messages or causing downtime. These metrics demonstrate the effectiveness of Cloud Native Kafka, deployments in production, environment handling, demanding streaming workloads. This isn't theoretical. These are real production numbers. Disaster recovery isn't optional. For critical streaming infrastructure. We implement a layered approach. Kubernetes operators provide automated backup restoration and failover capabilities. These operators understand Kafka's operational requirement and can perform complex recovery procedures automatically. Cross region backups ensure we can recover from regional failures. We maintain topic configuration, consumer group offsets, and ECLS in specific regions or in separate regions. Automated recovery procedures are tested regularly through kios engineering practices. We deliberately fail components to ensure our recovery mechanisms work as expected. Data, application and consistency guarantees ensure that failing over to backup cluster doesn't result in data loss or duplication. This requires careful configuration of replication factors and acknowledgement settings. Together these capabilities ensure business continuity for critical streaming applications. As we wrap up, I want to emphasize three key points, strategic planning, proper resource allocation and storage optimization aren't optional. They are critical for sustained high throughput performance. Many deployments fail because they underestimate these requirements. Second, embrace cloud native approaches. Leverage Kubernetes native platforms for scaling, monitoring, and operational excellence. Don't fight against the platform, use its capabilities. Third, be production ready from day one. Implement comprehensive monitoring, GitHub workflows and disaster recovery from the beginning. These aren't things you add later. They are foundational requirements. Building Kafka on Kubernetes at scale requires careful attention to these architecture patterns, but the result is a resilient, scalable streaming platform that can handle massive workloads with operational simplicity. And thank you all for your attention. I'm happy to take any questions about. Any aspect of scaling Kaf Kubernetes from regional specific configuration recommendations to troubleshooting any common challenges. You can connect with me after the talk if you'd like to discuss your specific use case in more detail.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Scaling Kafka on Kubernetes: Cloud-Native Streaming for 100,000+ Messages Per Second

Video size:

Abstract

Summary

Transcript

Slides

Shalini Katyayani Koney

Senior Engineering Manager @ Walmart

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Scaling Kafka on Kubernetes: Cloud-Native Streaming for 100,000+ Messages Per Second

Video size:

Abstract

Summary

Transcript

Slides

Shalini Katyayani Koney

Senior Engineering Manager @ Walmart

Join the community!