Mirror Mirror on the Wall, How Many Kubernetes Environments Are Enough After All?

Video size:

Abstract

One environment to deploy your app to production. One for staging. One to run your CI tests. One for each developer because you don’t want them to block staging. This talk will show how you can leverage existing envs for dev and CI so you can lower your cloud costs and reduce cluster sprawl.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone. Welcome to Con 42 Q Native. I'm Rs and I work as a senior Dere engineer at Metal Bill. In this talk, I'm going to be covering a very important problem, and that is the rising number of Kubernetes clusters in an organization once they decide to adopt Kubernetes. I'm going to show you that how when you as an organization make a decision to move to Kubernetes, you start with a couple of clusters, and then the number of clusters just exceed exponentially. Don't worry, I'm not just going to be talking about the problem in this talk. But I'm also going to give you a solution using which you can reduce the number of Kubernetes clusters in your organization, thereby reducing your cloud costs, and just making the lives of your DevOps teams and developers a lot easier. So let's just get right into it. First, I wanna walk you through how things happen when you as an organization decide to adopt Kubernetes. So it starts out with a lot of excitement, right? You have heard good things about Kubernetes, and you realize the benefits you will get, and you decide to adopt Kubernetes for your production workloads. So this usually means creating one cluster, which you'll use to deploy your production workloads. Then you realize that you cannot ship changes directly to that production cluster, right? Because sometimes your changes might break that cluster, and because of this, you add one or two staging clusters, which help you, you know, test changes. And then. You ship them to the production cluster too. So at this point things are still manageable and I feel like it's, we are at an okay point. Then you realize that deploying to staging actually takes a lot of time and developers don't want to wait that long just to get that validation. If there are changes in a poll request are working or not. At this point, you start spinning up clusters in your CI setup because you want to run tests and give feedback to developers early on instead of them having to wait for staging deployments. Also, when one team is deploying to staging, it becomes unusable for the other team, which is also another reason why the demand for CI clusters start growing. At this point, things have, you know, started to get a little. Out of control. You are realizing that your cloud bill is slowly rising, but it's still okay because you recognize the benefits Kubernetes is giving you and your team. Then one day you hear a complaint from your developers that even CI is so slow that staging and CI both are inaccessible because of how long they take, and your developers want a faster way to test their changes. At this point, you start thinking about personal environments and personal clusters, which mirror your production setup so that developers can test out immediately, and you are still able to ship at the speed that your business demands. Now you can clearly see your cloud bills rising. Now you're thinking about finops and how to optimize cloud costs because you are hearing pressure from everyone in your organization that we really need to cut down a cloud bill. But the story doesn't stop here because personal environments also now seem insufficient. Your developers want ephemeral on demand environments, so you either resort to provisioning a lot of clusters or start provisioning a huge cluster and partitioning it by different namespace, but the outcome is the same. You now have way too many clusters or namespace. Then your DevOps team can manage, and your cloud bill is just enormous. Anyone who has, you know, dealt with personal dev environments on Kubernetes, I'm sure you will relate to this meme where your cloud bill has reached a point where it's just not sustainable to continue with personal dev environments. It might have been obvious from the stuff we talked about earlier, but let's still take a detailed look at some of the downsides of having multiple Kubernetes clusters. First of all, your cloud bill is just enormous and your costs are rising. Secondly, managing all these environments takes a huge chunk of time and effort from your DevOps team time and effort, which could be better spent on other things had they had that flexibility. It also sometimes means you need to expand the size of your team and and hire more people to manage these clusters, which further adds to your costs as an organization. The other downside is that if you decide to do ephemeral environments, these environments still take time to provision. So even when a developer wants to, you know, test a code change or you spin up a cluster in ci, the developers end up waiting like 15 to 20 minutes each time before they're able to get that feedback and validation that the latest code they have written is working or not. And lastly, at one point you just have too many yams and too many clusters to manage. That you start to think why you even went down this path in the first place. Okay? I know at this point you might be thinking that you're doomed and there is no way out of this. But there actually is. What if you could use your staging environment for your CI and your dev environments? What if you could safely share that environment instead of having to provision individual environments or having to provision more clusters for each developer or each CI runs needs? This is where our open source project called Medi comes into picture. Medi lets you run your local process in the context of a Kubernetes environment. So while your process is running on the local machine, you actually get all the benefits you would had you deploy to staging without going through the pain of actually deploying to staging. This also means that you don't have to provision more clusters because your developers can safely run in the context of one staging cluster instead of requiring their own personal clusters, ephemeral environments, or things like that. Let's see how this works behind the scenes so that you start to understand what I'm saying here. With Medi, your application runs locally, but the traffic, the file system, the environment variables, all of that gets mirrored from the Kubernetes cluster. So your local process thinks that it's running on the cluster, whereas actually it's still running on your local. This way you're able to test in a realistic environment without provisioning more environments or without having to, you know, wait 15, 20 minutes for your code to get deployed to a staging or dev environment. This diagram here should give you a better idea of how this works. So you see that the local process is running on your laptop, but medi mirrors incoming and outgoing traffic between your local machine and the cluster. So let's suppose your process needs to talk to some other process or microservice, which is deployed on the cluster, but not on your local machine. Then any outgoing requests which your process makes, they get mirrored. To the target service, which is deployed in the cluster for that process and other services. See those requests as if they were coming from the. Target service. Similarly, any traffic you send to the target service in the cluster gets also mirrored to your local machine, so you are able to test out the behavior of the code you have just written without having to actually deploy it. If things aren't super clear from this diagram right now, trust me, when we see the demo, everything will connect. Before we go to the demo, I just wanna sum up the things Medi enables for you so that when you see the demo, you're able to recognize what is happening. With medi incoming traffic to your cluster is mirrored to the locally running process. Similarly, outgoing traffic from the locally running process is routed via the cluster, so it appears as if it's coming from within the cluster. Finally, instead of mirroring requests, you can also choose to steal certain targeted requests, so you can set filters that you know, if this request matches a particular header, you want that request stolen. And just send to your local process instead of being mirrored between the cluster and the locally running process. And you're also able to read and write to the files in the cluster in case your application requires that. And at the same time, you can mirror environment variables, which are there in the cluster to your local machine so that the configuration in which your application runs remains entirely the same, allowing you to test it in a production-like environment. Okay, it's almost time for the demo, but let's look at the architecture of the demo application so that you understand when we actually see the demo. So we will be working with the IP visit counter service, which when you send it a request, it returns back the IP the request was from, and the number of times the request has been sent from that IP behind the scenes. How it works is that anytime a request comes, it saves that request to a Reds database, and then it. Talks to another service, which is the IP Info Service, and it gets some information about the IP from that service. And then finally, it sends messages to Kafka, which get consumed by IP visit Consumer Service, which is for login. We won't be seeing the logging part in this demo, but we will be working with the IP visit counter service, making some changes to that and seeing how we are able to test out those changes without having to wait for our code to get deployed or without having to provision a cluster or a separate environment for this. So the setup will work with would look like this, that. You know, we assume that this application is deployed to a staging cluster by the DevOps team, like the standard staging cluster where you used to test, but we use the staging cluster to actually test a locally written code without having to commit the code or without having to, you know, go through pipelines and all of that stuff. Now, when I usually tell people this, these are some of the responses I have. I'm often met with a sort of disbelief or just straight up, you're lying, but a lot of people have questions about, you know, is this safe? Can you actually safely share the cluster? What happens if my process accidentally writes to a sensitive database? I'll answer all of these questions after the demo so that once you've seen Meridian action, a lot of things would just start making sense themselves. Cool. So now let's see the demo. Okay, so I have my application cloned here, and we are in the IP visit counter service, the main dot go file there. And the way this application works is that if you, it has been deployed, first of all on the Kubernetes cluster, and this is like a staging cluster. So if we get a list of all the deployments on that cluster, you'll see that the IP visit, counter service, the consumer service, it talks to Kafka, Redis. Everything has been deployed on the cluster, and none of this is running on my local machine. So this solves the first problem where, you know, developers need to set up their dev environment or wait for a ephemeral dev environment to get provisioned where all of these things take time to deploy. If you use a staging cluster for development with medi, like I've been suggesting, then you skip all of that because all the dependencies, databases, queues, whatnot, are already deployed on the staging cluster and you can directly talk to them using this method. So to give you an overview of the application, the way it works is that you can send a request to the endpoint and then it will give a response showing the IP address. The request was sent from the number of times we have sent a request and some text, which says, remote is fun. A. Plus high appended to it. This text remote is fun, is not hardcoded in the application. It is being read from a file which is on the cluster, and this high is hard there in the application code. So if I Google here, so searcher you'll see that. We get this response string and then append a high to it, but this response string is being picked up from a file which is on the cluster, and we will soon see that. But if I were to send another request here, the account should get updated to two like we see here. Cool. So now we are ready to test Medi out in the context of this application. And the easiest way to test Medi Out, so I'm in Cursor, but Medi has a code editor extension for Cursor via code Windsor JetBrains IDs. So you can search for medi extension and install it here. But if you don't want to use the extension for some reason, we also have a CLI available, which is which, which lets you achieve the same functionality you'll see me do here. Cool. So once you have the extension installed, you should see written here. And now we want to select an active configuration. So configuration for the service we are going to be working on. I'll walk you through the configuration later on. But since we are working for the IP visit counter service, I'm gonna select that configuration. So I go here, select active configuration. I choose IP visit counter. After that, I just simply enable my extension by clicking this dot. Cool. So once it's enabled, what we are going to do is set a break point. So now we saw that, you know, when we send a request to the end point of the cluster, nothing was happening. Now what we are going to do is we are going to use Medi and see that when we send a request to the endpoint, that request does get mirrored. To our locally running process. So to see that, let's add a break point. I've added a break point in the main function where we load some config and then I go to the run and debug section of my corridor. I click run and debug. You see, it says that using medi binary found and. Shortly it should ask us for a target on the cluster. So this target is the target whose requests we want to mirror to the locally running process. So basically we wanna establish a sort of connection between our IP visit counter service, which would be running locally on the machine and the IP visit, counter services deployment. So I come here and I'll choose the IP visit counter deployment, and. After that, ity would basically establish that connection for us. Okay, now things are running and we can go to a debug console and see that like this connection is established. And now when we send a request to the same end point, we'll see that our break point here on the coordinator gets hit. So I'm sending this request and we should get a response as expected, but here you'll see that our breakpoint would get hit. Once a break point gets hit, we can actually step into this function and. Here you see that we load some environment variables. So once these environment variables get loaded, the thing I want to show you is that you know, the con final conflict which we return. So I'm gonna like just continue till we hit this break point. And here in the final configuration, you see that we load all of these environment variables and they get loaded from the cluster. So I have not set any of these locally. And here you see that we are reading a response file, which is at app. App slash response txt. And this file does not exist on my local machine. It exists on the cluster. And this is a file which has the remote and remotes fund text I was telling you about earlier. So if we return from this function, go here and we, you know step down here, you'll see that we are reading this file. The file is app slash response txt, and from here we are loading the file content. Which when we passe here in a response string, which is empty right now, but like if we go here, then it gets populated and response string is now remote is fun. So you saw how our locally running process is actually able to communicate with the cluster, read that file, and see the contents of that file without us having to, you know, deploy this particular code like. We have not made any changes to it, but without us having to go through pipelines or set up our own environment, we are able to leverage the environment of the existing staging cluster. So here we saw how we can mirror requests and basically test out our code. So assume you had made some change and you wanted to test it out, you could just mirror the request and then that request would get forward to your locally running process and you can test it out, but that's not it. With mirror D, you can even test the response of your code. So here you saw that the response we were receiving, this was being sent by the counterpart of the service, which is in the cluster already. But if we want a locally running code to, you know, respond to requests, we can do that as well with the steel mode in mirror team. So the way to enable steel mode is by editing your ity dot con jason file, which is what I was talking about earlier here. Our mode was set to mirror and I'm gonna change it to steel and. I don't wanna steal all requests because one of the main benefits of MERIDIA is, is that it allows multiple developers to share the same Kubernetes cluster for development. So what I can do here is add an H two TP filter with my name. And now basically all the requests which have this filter set would get stolen by medi to my local machine, and all other requests would remain untouched. Now, this is really important because it ensures that other people or other developers or even AI agents can continue using that staging cluster, and I do not, you know start stealing all the requests through my locally running process, thereby breaking the cluster for everyone else. This way, I'm able to work independently while sharing the same staging cluster for development as everybody else. So once we have this filter set, I'm also gonna make some small code chain. So we saw the text high earlier, right? I'm gonna change this to high people from. Gone 42 and now this is part of our response, and this is really important because we want to test out that if Medi is allowing our. Locally running process to respond to requests and thereby allowing us to test our locally running code without us having to deploy it to the cluster or provision another cluster where this gets deployed. So with that said, I've set a breakpoint as well in this GetCount function. And now I am going to run and debug again with Medi. I will choose the IP visit counter service again. And I forgot to remove this break point. So this first break point was hit immediately, but we can continue. And now when we go back here, right, and we send a request to this end point. Nothing would happen. And that is as expected. So our remote process is still replying and giving us a response. We see the text, which is not showing the change we made, and that's expected because we are not using the HGDP filter, which we had set. But if I sent a request which has that HT HTT B filter, we will see that a break point gets hit. So you see the break point here gets hit and we are able to, you know, check the same ip, which is the IPFR machine, and now. We are also not getting a response here because the request has been paused by our locally running code, and now if I let go of this request and go back to my terminal, I should see a response, which SH shows our updated string. So you see here we got our response and it says, remote is fun plus high people from Goth 42. So this is very important to understand how Medi allows you to, you know, test out your code changes, including the response of your code or anything else in an isolated manner, but in a shared environment. So you share the same cluster as everyone else, but you're able to safely and independently test out your code. Okay, so that was a demo which showed. Some features of Medi, but I also wanna touch upon other features which make sharing a cluster really easy with Medi. So first thing which we saw is that you can set HDP filters, which ensure that you know each request is handled once and is routed to the right process, the right developer who's working on that particular thing. If a request hits the remote service in the cluster, DY will look for the right HGDP filter and then forwarded it. To the right developer. This way, developers don't end up interfering with each other's work and are able to work independently. Secondly, if your application, you know, consumes messages from a Q service like Kafka or SQS, Medi supports q splitting, which enables concurrency. So Q Splitting is a feature in Medi, which allows multiple developer. Purse to safely consume messages from the same queue without disrupting the environment for other people. Now, I won't be going into the details of this, but you can check out our documentation or our YouTube channel as well where we have created short videos explaining each feature. Third feature melody has are things called policies and me Policies help. Prevent forbidden action. So if you have a sensitive database, which you don't want your developers accidentally writing to, you can set up a mey policy, which prevents that from happening. So this gives you the peace of mind that your shared staging environment will not be broken down by any single developers changes or when they are, you know, using that environment for development. Lastly, if you do not wanna use some staging resources, let's suppose a database, which is in staging for any reason you can choose to run those services locally. So Medi gives you the flexibility of, you know, using some services from the cluster or running some on your local machine, like databases you might want to use because you're testing out a migration or you just don't wanna affect the staging database or talk to that. So these are some of the features which make sharing a staging cluster for development using medi completely safe and possible. And this is how you can avoid creating ephemeral dev environments or personal environments, or even spinning up clusters for your CI process because you can use Medi CLI in the CI itself. So. This brings me to the conclusion of my talk, and I just wanna sum up everything we have seen so far. The gist is that once you as an organization adopt Kubernetes, if you are not careful the amount of Kubernetes clusters, you start provisioning, just go out of hand because you're provisioning clusters for ci, you're provisioning clusters for devs, you're provisioning on demand clusters for future branches and all of that stuff. Medi helps you avoid that by letting multiple developers and your CI use the same staging cluster for development instead of provisioning clusters for each and every one. This gives your developers and your CI tests a production-like environment to run in. Whereas it also saves your DevOps team the time and hassle of provisioning and maintaining multiple clusters which look like production. So with this, you have one production cluster and you can have another staging cluster, which looks exactly like production and that staging cluster. Is the only thing the DevOps or the platform engineering team has to just take care of, you know, making sure that looks like production because all your developers and your CI can just then use that cluster for development or testing purposes. It's much better than having multiple clusters, which you have to manage and which increase your cloud bail and. Make for a really bad developer experience because of how slow they are, because of the time they take to provision. When you go with this approach, you have one cluster which is already ready and nobody has to wait for things to provision and they can instantly access and test out. The code changes as we saw in the demo as well. So to sum up with medi, your cloud costs are lower, you ship way faster, and most importantly, your developers are happier. Thank you for watching this talk, and I hope you really found this useful and helps you reduce the number of Kubernetes clusters you have in your organization for development and invest purposes. If you want to try that out, you can check out our website, metal bear.com/medi to learn more about me, or if you want to get hands on and try out Medi for yourself, you can check out the open source project on GitHub, the link you can see on the slide. And if you have any questions about the talk, if you have any feedback or if you wanna set up a call to understand how Medi works, my calendar is always available to talk about Medi. So you can send me an email at at the red metal bear.com. Thank you for watching and I hope you all have a great day.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Mirror Mirror on the Wall, How Many Kubernetes Environments Are Enough After All?

Video size:

Abstract

Summary

Transcript

Slides

Arsh Sharma

Senior DevRel Engineer @ MetalBear

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Mirror Mirror on the Wall, How Many Kubernetes Environments Are Enough After All?

Video size:

Abstract

Summary

Transcript

Slides

Arsh Sharma

Senior DevRel Engineer @ MetalBear

Join the community!