Scaling Multi-cluster Kubernetes at Teleport Cloud

Video size:

Abstract

Resilient multi-cluster Kubernetes architecture is challenging to nail down, especially for applications that require complex and coordinated deployment strategies.

At Teleport Cloud, we manage dedicated instances of Teleport using Kubernetes clusters (EKS) in six regions that maintain more than 100k reverse tunnels.

This talk covers our strategy for configuration management, multi-cluster coordination, and zero-downtime deployment with ultra-long-lived TCP connections on Kubernetes.

Summary

Teleport is an open source infrastructure access platform that makes use of reverse tunnels. For teleport cloud, we run a dedicated instance of teleport for every customer. Maintaining highly available, ultra long lived reverse tunnels to the resources that people need access to is a top thing.
How do we upgrade teleport, auth and proxy across six regions in a coordinated way? The first thing we tried was Gitops. We built something that just solves this problem in a really narrow way. It's called sync controller.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Today I'm going to talk about how we scaled multi cluster kubernetes at teleport cloud. So what is teleport? So, if you don't know, teleport is an open source infrastructure access platform that makes use of reverse tunnels, like for example SSH tunnels, to provide audited access to infrastructure for developers, also for CI CD systems or other use cases like Internet of things. And so Teleport provides access to Kubernetes SSH database, like postgres, for example, web application and Windows desktop resources. It understands the different protocols behind those resource types so that it can provide application level auditing, right? It understands what Kubernetes commands look like, what SSH commands look like, to be able to audit the connections to those different resources that you might be wrong. And so if you look at teleport's architecture, the kind of key components that run on the server side are the teleport proxy, which provides connectivity, like for instance, managing reverse tunnels. And the Teleport auth server, which does authentication and authorization, manages like user roles, provides the kind of back end logic. And so if you might imagine what a client connection to a resource that's managed by teleport would look like, say you're connecting to a Kubernetes cluster, you might run Kubecontrol get pods, and you're not connected to directly to a Kubernetes cluster, but instead you're pointed at a teleport proxy through an mtls connection. And then the teleport proxy also has a reverse tunnel coming from that Kubernetes cluster egressing out through a firewall. And the connection is coming from a teleport agent that's running in a pod in that cluster. And so your Kubecontroll get pods command goes through the proxy through the reverse tunnel and reaches the cluster on the other end to provide access. And so Teleport cloud is a kind of hosted version of teleport that's offered by teleport as a company. And for teleport cloud, we run a dedicated instance of teleport for every customer. And so that means we're running a deployment of teleport, many deployments of teleport, where we're operating over 10,000 pods, we're operating over 100,000 reverse tunnels at a given time. And this happens across six regions to provide global availability anytime any of those tunnels are disrupted, that would disrupt access to the underlying resources. And so it's really, really important that we provide a very stable network stack for this platform. Another kind of important detail here is that those proxies, they're running each of those separate kubernetes clusters in each region, peered connections. So a client might connect to a proxy in one region, and a peered GRPC connection between the proxy from that proxy to a proxy that the resource they're trying to access is running in happens, and then the connection goes through the reverse tunnels to that resource. For a long time we ran teleport cloud on gravity. We've recently switched to eks, where we have an eks cluster in each of the regions that we provide connectivity for. And so if you break down the kind of major needs that we had when putting together this platform, I'll just kind of focus on three really important things. Ingress. So maintaining highly available, ultra long lived reverse tunnels to the resources that people need access to is a top thing. It was really important that we be able to do coordinated rollouts across these regional clusters. And so we often want to update auth servers when we do an upgrade of teleport first, and then upgrade the proxies because of how the Auth servers cache proxy heartbeats, or we might want to not roll the proxies in every single region. At the same time, we might need a more coordinated deployment strategies for some different customer use cases. And finally, container networking. All these clusters have to speak to each other. The proxies need to peer. Everything needs connectivity to auth servers, but AutH servers don't run in every region. And so just focusing on ingress first, I'll kind of go through a journey of things. We tried for all of these and how we arrived at the architecture that we have today. So first we tried anycast, actually, and you might imagine how this went. Anycast, we really liked any cast. We tried out anycast through AWS global accelerator. We liked it because it provided really stable ip addresses. DNS never had to change. We could give everybody in the world a single ip address for teleport cloud, and they'd be able to really reliably connect to that. But at the same time, as you might imagine, anycast didn't quite provide stable enough routing. I think there have been a lot of success stories with anycast and streaming video, like kind of long lived TCP connections in that context. But we found that even with a lot of logic, that would resume reverse tunnels quickly if they dropped. It really just wasn't stable enough to provide really consistent, really highly available connectivity. And so any cast didn't work out for us. We also tried open source Nginx as a kind of ingress routing layer in each of the clusters. And we really liked Nginx because it supported ALPN routing. So Nginx will let you route connections from different clients using. So ALPN is kind of similar to SNI. It's some metadata in a TLS connection that'll let you make decisions about how to route that connection without terminating it, without decrypting it. And so the ingress stack could separate client connections from agent connections and route to the correct proxies using that metadata. And Nginx did a great job with that, but it had no in process configuration reloading. And so anytime we'd need to make a change to these routes, we'd need to start a new instance of NginX. And eventually the memory usage got really high. It just didn't work well for how often we needed to change the ingress configuration. And so we didn't go with Nginx either. And so now I'll talk about what we did do. So our ingress stack at the DNF level, we used route 53 latency records with the external DNS operator. And so the external DNS operator would publish latency records to route 53 based on the nlbs that separate network load balancers that would sit in each region. And we use network load balancers instead of any cast because they're kind of stateless. So when there are changes to them underneath, they don't necessarily drop connections. And they let us provide really stable reverse tunnels. Instead of NgINX, we went with envoy proxy, which also supports ALPN routing, which is great. And we configuration it using Gateway API, which is a kind of new set of APIs for ingress in Kubernetes that aims to replace ingress resources that have existed in the past. And we used our own fork of envoy gateway that had some. For one, it supports using annotations to do ALPN routing on TLS routes, but we also made some kind of stabilizing changes there that we're working to get upstream. And finally we have a little hack for doing zero downtime deployments using min ready seconds on top of deployments that I'll talk. And so, to kind of break down what this looks like, imagine you're trying to connect to your Kubernetes cluster using Kubecontrol and the teleport agents running in the Kubernetes cluster, which is behind a firewall. So the relevant pieces of teleport you'll see here are the proxy pods running in each of in this case, we'll just talk about three regions where the client is connecting through proxies in us east one to proxies in AP southeast one where the Kubernetes cluster is running. All the proxy pods have a kind of streaming connections to the auth server, so they get up to date information on roles and information needed to make authorization decisions. All the proxies cache this information so that they can really quickly make these decisions without having to reach out to auth for all of it. So we have the reverse tunnel that's coming from the Kubernetes cluster, and we have the agent connections, these Goop control, that get routed to the closest proxy through a peered connection between the proxies and then finally back up to the Kubernetes cluster. So if we take auth out of this for a second and just look at the connectivity piece, what this looks like is we have the external DNS controller running in each cluster, and we have nlbs sitting in front of the nodes that run the proxy pods in each of the clusters. And external DNS is reading the ips of those nlbs and reporting them back to route 53 so that it can report the ip of the closest proxy when your end user, client or agent needs to connect to a proxy so that it gets the closest regions. And so how do we do a zero downtime deploy on top of this architecture? And this gets a little coordinated because we have these reverse tunnels that are open all the time connected to proxies, but we need to update those proxies, right? So if you look at the architecture we had before, what this looks like is that we have envoy running in each cluster, configured by Envoy gateway routing to different proxy pods for different customers. And when we want to do a deploy, we use the ALPN routing feature in envoy in order to control whether connections land on the old pods or the new pods. So an interesting thing we did here is that we used a feature of Kubernetes deployments that you might not know about called min ready seconds. And so this is kind of similar to a termination grace period where you keep multiple sets of pods running at a time, old generation and new generation running at the same time. But with min ready seconds, you can keep both generations responding to new network requests at the same time for a certain period of time until the new set of proxy pods is not just considered ready, but after the min ready seconds is considered fully available, and then the old pods will terminate. So we use this period of availability of readiness at the same time for both generations of pods to allow new tunnels to come from the existing agents and hit the newly spun up pods until all the new tunnels are fully established before we make the ingress changes that start routing connections through the new set of pods. And then after that happens, and after old connections drain off through the old network pathway, then we shut down the old set of proxy pods. And the way we do this flip from one set of pods to another is that we have a custom controller that changes the labels on a service to point from the old generation to the new generation of pods using our own custom controller logic underneath. And so that's basically an overview of our ingress stack for how we route really ultra long lived tunnels. We route client connections through those long lived tunnels for Kubernetes clusters in six regions. The next thing I'll talk about is deployment. So when we need to upgrade teleport, I talked about how we do it like a zero downtime upgrade for an individual cluster for proxies, but how do we upgrade teleport, auth and proxy across six regions in a coordinated way? And so we tried a couple of different options here. The first thing we tried was Gitops. So what's the most you're doing? Deployment to Kubernetes clusters. What's the first thing you think of? Right, use flux CD or a similar tool to deploy it from a git repo. And so we had our own custom controller, a CRD that's reconciled by a controller we have called tenant controller, and the CRD is of course called tenant. And we thought about storing that configuration in git for each customer and then applying that to all of the clusters. And a disadvantage of this approach we found, number one was we really wanted all the data to stay in postgres. We didn't want to start writing a bunch of customer data into a git repo and having to manage that git repo over time. Another need we had there, another thing didn't work. Sorry about the Gitops approach with Flux, was that flux is very unidirectional. So we didn't just want information synced from a git repo into clusters, we wanted to pull information out of those clusters in order to be able to progress the deploy to more steps. So auth servers finished deploying. Now we want to update proxies, maybe we want to update proxies. We don't want to update every region at the same time. So that approach didn't work. We didn't go with Gitops. We tried cross cluster reconcilers after that, where we had a controller running in each regions, but we didn't have a CRD in each of those regions. They all reconciled against a custom resource, that tenant custom resource in a namespace in the management cluster. So we have a namespace for each customer in every cluster, but the custom resource only lives in the management cluster. In this proposal, this didn't work very well either. So if we'd gone with this, we would have created a big single point of failure for the whole platform in that one cluster. We didn't like that. We wanted everything to be able to operate without management. And there were some difficulties in, like we'd have to have all of the regional clusters write to the same status field of that shared tenancy are, and it leads to conflicts and other problems. And so what we arrived at was neither of those things. We really liked kubefed. We thought Kubefed kind of maybe was on the right track with exposing APIs from one cluster into another cluster, so you could have a controller that understands how to operate custom resources that it's not reconciling, which are reconciled somewhere else. We really like that model, but the project isn't active anymore, and it seems like it'd be a big risk to pick up Kubefed if there really wasn't a lot of activity there. And so we built something that just solves this problem in a really narrow way. It's called sync controller. We just open sourced it a couple of days ago, so you can check it out if you want to. The way sync controller works is it let us build this architecture where we could have a management cluster that is driven by that tenant custom resource inside of a customer namespace. But the controller for the tenant resource is just responsible for creating additional teleport deployment resources, one for each regions that the teleport needs to operate in for that customer. And then that custom resource is synced just to an individual instance of the same resource that lives in each of the different regions. And so tenant CR might create, if there's customers in three regions, US west two, US east one, and AP Southeast one, and each of those are then picked up by sync controller running in those regions, which then creates a namespace, creates the resources, and then reconciles that resource there. And so to kind of dig into what that looks like really in more detail in the management cluster, this diagram here isn't specific to teleport, just generally how you use sync controller. I'll show you a teleport specific version in a second but in this instance you have synccontroller running regionally. It watches the spec of the resource in the management cluster. It copies any changes that it sees to the spec of that resource into the instance of the resource in the regional cluster, where it's then reconciled. The reconciler then writes the latest status of that resource in the regional cluster, and then sync controller is also watching that regional status and copying that back to the management cluster. So that from both the regional cluster's perspective and the management cluster's perspective, you have the same resource, but the management cluster can create and kind of operate a selection of these resources from the outside, whereas the regional cluster can do the actual reconciliation of the resource and create the necessary pods. And so for teleport. So here's the kind of teleport specific version of what this architecture looks like. We have sync controller running in the regional cluster. It's copying the teleport deployment spec into the regional cluster for management, and where it's being reconciled by the teleport controller that creates the auth server and proxy pods, and then any changes to the status are sent back to the management cluster. And then in the management cluster we have a tenant controller that's for one, it's creating, doing any centralized work. So it creates the dynamodB tables or Athena resources for audit logging, all of those things that are shared. And then it creates a selection of teleport deployment resources that configure each region. And the nice thing is that it can react to changes in those, so it can listen to the, it can watch for status changes in us west two know when auth is finished deploying and tell regions that they can update their proxies, for example. So it can make decisions based on the challenging state in the different clusters. And this architecture worked really, really well for us. It's really nice because we can lose the entire management cluster, and all of our regional clusters are still operatable. We have to change the teleport deployments manually in the different regions, but they can keep reconciling forever in that state. So on top of this, I mentioned earlier that we wanted to store customer data in postgres, right? So we kind of built the configuration storage into our customer portal. And so when customers sign up or when employees want to manage customer information, obviously through teleport, all the data is stored in postgres, but the data that needs to live on the cluster as well is stored as a custom resource, as a tenant custom resource in JSON B and postgres. And then whenever there's a change to that data. In postgres we have a sync service that reads that change and sends the change version of that to the management cluster. A really cool thing we did here is we took the open API schema validations that the cluster uses for that CRD, and we also apply them to validate the request to change customer data at the portal level so we don't end up with a CRD that wouldn't apply to the cluster getting stored in the database. That was surprisingly easy to do, using kind of the open source tooling for open API available on GitHub. And so that covers the way we do our coordinated rollouts across clusters. The next kind of big important piece of all this is how did we do container networking? How do we let proxies talk to auth servers? How do we let proxies peer with each other across regions for this massive multi cluster deployment? And what we found was that Psyllium Global services worked really, really well. We tried some different deployment architectures with psyllium. We found that having a dedicated ETCD can perform the best. It let us deal with a lot of pod churn. So whenever we do a big update for many tenants at the same time, and we have a lot of new pods spinning up and shutting down, or also when we update selenium itself, and there's a lot of things that get reconfigured, a dedicated ETCD ended up performing the best. There are some other ways of deploying psyllium global services that you can look into, but that's what worked for us. And so to kind of break down what that looks like, I'll trade this diagram that you saw earlier where you have the teleport deployments in each region, if we focus on the services that get created by the teleport controller there for a second. So we have an auth service and we have a proxy service, but we don't have auth running in every region. So whenever proxies need to speak to auth in us east one, those connections get redirected to Auth, to a service that has the same name that cilium automatically provides forwarding connectivity to in us east one, to us west two. So that lets us run our Auth servers in multiple availability zones in one region, but not have to run them in every region. And proxies in all regions have cached access to those auth server pods. In some cases, we don't have proxies available at a region, and in those cases, our custom controllers also can create a global service that redirects proxy connectivity to the closest region which can calculate. And so that is our user journey through teleport cloud architecture covering ingress deployment and container networking. Just as a reminder, a lot of the stuff you saw today is open source sync controller. You know, as Apache two licensed, we actually just open sourced this a couple of days ago. Please check it out. It's not something you deploy to a Kubernetes cluster. It's a tool you can use to build. It's a reconciler you can import into your own controller manager that will let you create a management plane using your own custom resources. We also maintain a fork of envoy gateway that supports ALPN routing and has a couple of other changes we made and a lot of them we got upstream actually to kind of stabilize some parts of envoy gateway work. And finally, of course, teleport is Apache two licensed. You can check that out as well, deploy the open source version and see what you think. And last but not least, I want to give a huge thanks to everybody on the teleport cloud backend team. Carson, David Tobias and Bert Bernard. You can kind of see the parts they worked on here. This was a huge team effort, wasn't just this team, right, to kind of get this platform together. And that's all I got. Thanks everybody.

Slides

Download slides (PDF)

See all 21 talks at this event!

Conf42 Kube Native 2023 - Online

September 28 2023

Scaling Multi-cluster Kubernetes at Teleport Cloud

Video size:

Abstract

Summary

Transcript

Slides

Stephen Levine

Software Engineer @ Teleport

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2023 - Online

September 28 2023

Scaling Multi-cluster Kubernetes at Teleport Cloud

Video size:

Abstract

Summary

Transcript

Slides

Stephen Levine

Software Engineer @ Teleport

Join the community!