Conf42 Cloud Native 2021 - Online

Containing an Elephant: How we took HBase and HDFS from private data centers into Kubernetes and Public Cloud

Video size:

Abstract

Operating a distributed system like HBase/Hadoop FS at peta byte scale took years to master in our private data centers. This talk describes our dramatic shift towards running a mission critical stateful application on Kubernetes in Public Cloud, why we did it and the challenges we had to overcome.

Salesforce runs a very large footprint of HBase and HDFS clusters in our data centers with multiple petabytes of data, billions of queries per day over thousands of machines. After more than a decade of running our own data centers, we pivoted towards public cloud for its scalability and availability. As part of this foray, we made a bold decision to move our HBase clusters from staid bare metal hosts to the dynamic and immutable world of containers and Kubernetes. This move brought with it a number of challenges which are likely to find echoes in other such mature stateful applications adapting to public cloud and Kubernetes. The challenges include 1. Limitations in Kubernetes while deploying large scale stateful applications 2. Failures in HBase/HDFS as the DNS records keep changing in Kubernetes 3. Resilience of HBase/HDFS even when a whole availability zone fails in Public Cloud 4. Introducing encryption in communication over untrusted public cloud networks without application being aware

This talk will go over how we overcame these challenges and the benefits we are beginning to see from these efforts.

Summary

  • Salesforce has been running HBS in its own private data centers. When it moved some of its clusters into public cloud, it used kubernetes to manage them. Using Kubernetes and its elastic management of pods is one of the most important reasons.
  • The difference between stateful and stateless applications. Typical stateless application is basically HTTP servers that run in a website. What Kubernetes provides is the ability to create these pods. Here are some of the challenges we faced once we started using it.
  • Using stateful set we were able to deploy Hadoop hbase. We found many of its features around managing compute storage failover et centers very useful. But there were challenges with its rolling upgrade process where you're trying to upgrade the software. Fortunately, Kubernetes is also very extensible.
  • Another area of problems that we experienced was around DNS. As pods move from one hosts to another, the IP addresses keep changing. We came up with a solution where for each one of our pods we put a load balancer. But there's no guarantee that your clients are actually running in the same Kubernetes cluster.
  • Typically you deploy software in a certain region. You can also spread your software across different availability zones. In kubernetes there has been significant effort to make sure that you can support this kind of deployment. It's really useful for resilience in general.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Dhiraj Hegde and I'm from Salesforce. Today I'll be talking about how we manage hbase in public cloud. Hbase is a distribute at key value store that is horizontally scalable and it runs on top of Hadoop file system. For many years, Salesforce has been running HBS in its own private data centers. It's been running a very large footprint of clusters with thousands of machines, petabytes of data storage and billions of queries per day. But it had been using very traditional mechanisms of managing these clusters on bare metal hosts using open source tools like puppet and ambari. When we started moving some of our clusters into public cloud, we decided to take a very different approach, using kubernetes to manage these clusters. We'll explain in this talk why we chose kubernetes, what challenges we ran into with our choice, and how we overcame those challenges. So why did we pick kubernetes? There are a number of reasons. One of the main problems that we ran into in an old deployment mechanism was that it was in place in a hosts, meaning you would go into a host that already had software installed and try to modify it. The problem with that is that when you are trying to deploying on thousands of hosts, some of those hosts, the installation could fail, leaving it in an uncertain stateful, with some binaries present or some configs present and others missing. And if you happen to miss these failures, those hosts would remain in this inconsistent stateful for a very long time. The other thing we have noticed is there's a temptation when you're dealing with emergency issues, maybe a site issue that people just go and log into. Hosts go in and modify these configurations locally and completely forget that they did that. And the problem there again is that once the config is forgotten, it looks a little different on that one host compared to many of the other hosts. With containers. A lot of these problems go away because they have this thing called immutability, meaning with a container, it is actually created from an image, and that image contains all the configuration, all the binaries already present. And as that container comes up, it is an exact replica of what that image has. And if you even attempt to make a change after it's running, what happens is that the next time the container is restarted could be a crash or a deliberate restart. It would once again start from the image. So whatever you made, changes locally would be totally lost. So in that kind of immutable environment, it is pretty easy to make sure that over a period of time, the images are all consistent, no matter what people try to do to it or what failures happen with it. The second reason why we felt Kubernetes and containerization was a good thing was availability and scalability. With Kubernetes, it's really good about making sure that if you are running your containers on a particular host or set of hosts, and if something happens to those hosts, kubernetes can monitor and cache those issues and immediately restart those containers on a different hosts. By the way, just to be clear here in Kubernetes, containers are managed by a construct called pods. But in this talk we are going to treat the term containers and pods as one and these same to keep it simple. They are not exactly the same, but we are not going to get into it. And for the purposes of this talk, it doesn't really matter as much. So coming back to the question of availability, it makes sure that when pods fail on a particular host, it will make sure that if it can find another healthy host out there, it'll move these pods to those other hosts. Similarly, when you need to scale up your application, it's very easy in Kubernetes to specify a higher number of pods to meet to the requirements of that particular traffic. And once the traffic spike is gone, you can also equally quickly reduce the number of pods that are run to serve the traffic. So this ability to scale up and down is very valuable to us, especially when you are moving to the cloud. And one of the big claims of cloud is that you got this elasticity. So how can you take advantage of this elasticity? Using Kubernetes and its elastic management of pods is one great way of achieving it. The third reason is actually one of the most important reasons as to why we went with Kubernetes. We had this desire or goal to make sure that whatever we built, and we started by these way with AWS, which is Amazon's cloud offering. But our plan was to try and build something there that would apply to other clouds, because it is very likely that we would be running a software in other clouds. Example of other clouds are things like Azure from Microsoft or GCP from Google. So we wanted to be able to run our software in all those different environments. But the problem there is when you're building your software deployment processes for one cloud, you are so intimately tied to the APIs and the way they manage compute, storage network that whatever you build there doesn't naturally apply to these other clouds. But fortunately with Kubernetes, things turn upside down. Kubernetes is actually very opinionated. It actually specifies exactly how compute should be managed, exactly how storage should be managed, and how network should be managed. At that point, it becomes incumbent upon the other cloud providers to make sure that they manage storage, network and compute in the way Kubernetes expects it to be managed. So when you build your software deployment processes around kubernetes on any one of those cloud providers, you automatically get the same deployment processes working in all these other different clouds, because it sort of enforces that sort of behavior. So for those three reasons, we found Kubernetes to be very interesting to us. Okay, now we'll get into a little bit about some of the challenges that we had with Kubernetes once we started using it. To understand it a little better, you would need to know the difference between stateful and stateless applications. Here we show you a typical stateless application, basically HTTP servers that run in a website. Each one of these server instances has nothing that is unique in them. They kind of serve the same content. If you ever try to make a request to them that changes the content, it usually goes to some back end database that is shared across all those instances, or maybe even a shared file system across all these instances. So essentially each one of those running servers in Kubernetes world, these would be pods. They're all stateless. And when a client is trying to access the service from them, they typically go to a load balancer, as you can see here, and the load balancer then forwards those requests, each or any one of those HTTP servers in the back end. Now the interesting thing to note here is that the client only needs to know the hostname and IP address of the load balancer. It doesn't even need to know the host names of the HTTP servers running behind it, because they are just getting traffic forwarded to them by the cloud balancer. Kubernetes usually manages these kinds of environments by specifying a manifest, which is really a document describing how many instances of these HTTP servers to run as pods. It can also describe the configuration of the cloud balancer, which is called a service in Kubernetes, but again it is specified using a document called a manifest. And with that you can nicely set up all of this. And in the early years, this is what Kubernetes was really known for, the ability to very quickly spin up a set of stateless applications and provide some sort of service. But when you look at something like Hadoop hbase, which is a very stateful application, it's a database. Typically the way it behaves is that you got a client there. And when it needs to access data for reasons of reading, for querying, that is for modification, for deletion, so on and so forth, it needs access to the data servers which you can see at the bottom, lined up at the bottom, a number of these data servers. And each one of those data servers has very different data in them. It typically does not have the same data across all of them because then every data server would have to have huge amount of storage. Instead you break it up into pieces and you spread it across a large number of data servers. And that's how you scale as you need to store more data. You just add more data servers. And on the left hand side you typically have something called a metadata server which is responsible for knowing where all this data is present. It basically knows the geography of things. So a client, when it is trying to access data, it goes to the metadata server with a key saying that I want to access this. The metadata server then gives a location of where these can find the data. And then the client directly goes to the data server based on these information and accesses the data. Now in all of this, what you will notice is that the client actually needs to know the identity, the hostname of all the elements. I mean it needs to know the host names of the metadata server that can provide this information. And once a metadata server provides the hostname information of the data server, it needs to directly deal with those particular data servers. There's no magical load balancer hiding things from you, which is why the DNS that you see on the right hand top corner is very important because it has to have all these information of the host names and IP addresses. And by the way, all the clients usually have a cache because once you discover this information, having to rediscover this information every time for a request is very inefficient. So you hbase to kind of cache this information and hopefully the stuff that you cache doesn't change too often because that is a little bit of a disturbance that the client had to deal with. So you try to minimize this disturbance. So all of this makes it very important that the identity of the servers that you're accessing tend to be stable. They do not change very much. And given a host name, the location of what it contains, it becomes very important. Took so that association between the name of the server and its content, the state it contains is very, very important to the client for good performance. And this is a very good example of what a stateful application is, and you can see how it's a little different from stateless applications. So the Kubernetes community, when it decided to support use cases like Hadoop, hbase or Cassandra, they introduced a feature called stateful. What stateful provides is these same ability to create pods, but in this case, when it creates pods, it gives it unique names. And the names are kind of easy to guess. They usually begin with the same prefix. In this case it was pod. As an example I took. It can even be HTTP or hadoop or whatever, but then it would associate with each instance of the pod a unique number, like zero, one, two, depending upon how many of these you want. So each pod would have a unique name. And also you can see in the right hand top corner that the DNS is also modified to give these pods a proper hostname and IP address. So you got everything with it. You got unique names, hostnames and IP address associated with each one of these pods. In addition, since this is a stateful application, you could also define how much storage you want to associate with each one of these pods, the size of it, also the class, whether you want SSD or HDD. All of this can be specified using a construct called a persistent volume claim. It's basically a claim for storage. It's not the actual storage you are requesting storage. And this is embedded in each one of these pod definitions where a pv claim is specified. When this is defined, what happens is that the providers or cloud providers who run Kubernetes like AWS, Azure or GCP, they will notice this claim and these immediately carve out a disk in the cloud which has these size and the class of storage that is being requested, and then it is made available to Kubernetes. Kubernetes then mounts that disk in each one of these pods so that it becomes a locally available storage in each one of those pods. And at that point you've got a unique name which is well defined in DNS associated with storage. And this is a one to one mapping between the two. Now what's interesting to note here is that let's take an example of any one of these pods which is running on host a right now. It has a claim and it is accessing pv zero which is mounted as a disk in it. Let's say for some reason host a has some problem, it goes up in smoke. Kubernetes would notice that. And it will then say that okay, this pod is gone, it'll remove it from its system and you'll notice in the right hand top corner that the DNS also is modified to remove any DNS entries related to it. You can still see the storage is present because claim that created the storage is still present. These pod is gone but the claim is still there. Eventually what Kubernetes will do is it will find another free host like host d in this example and recreate the same pod so it has the same hostname. Pod zero, the one that got destroyed, is recreated with the same hostname. And because it has the same claim embedded inside of it, these same storage is again associated with it. And even DNS is updated to have the DNS record. One thing that is different here is that when the DNS record is recreated, it did not get the same IP address. So it had the same hostname, but the IP address had to change then. That's just the nature of networking. When you move from one compute unit to another, the IP addresses that you associate with that compute unit has to change. It's just how network is managed in kubernetes, but otherwise you basically achieve something quite interesting, which is a given host name is always associated with the same volume no matter where your compute goes, moves around inside the Kubernetes clusters a very interesting and useful property. That stickiness between hostname and the volume that you use. You also notice that the IP addresses can change even if the hostname doesn't change. And this is kind of important because we'll get into some of the issues we have because of this a few minutes from now. So using stateful set we were able to deploy Hadoop hbase and going back to the same slide that I showed you a while back, you can see that each one of the data servers is being deployed as a stateful set. And every one of them has a unique name, like DS 123-4123 rather. And similarly the metadata server, which is also stateful, has a unique name and disks associated with it. So we were able to model and deploy a software using stateful set pretty well. Stateful sets are managed by a controller called as the stateful set controller or the STS controller in Kubernetes. And while we found many of its features around managing compute storage failover et centers very useful, we also had some challenges with it. One area where there was a problem was with its rolling upgrade process where you're trying to upgrade the software. And the way it does upgrade is it starts with the pod with the highest number and goes one by one, upgrading each one of them in strict order all the way down to zero. And while this is a very nice and careful way of upgrading software. It is also very slow. You can imagine in the world of Hadoop hbase, you got hundreds of these pods, and each one of them is a heavy server that takes around five minutes to boot up and initialize and set up its security credentials, cordless kerberos, key tabs, et cetera. So going through it one by one would take a very long time and almost make it impractical for us to use such a valuable feature. Fortunately, Kubernetes is also very extensible, so you can kind of go in and modify behavior or introduce new behavior by providing your own controllers in certain areas. And in this particular case, we were able to build a new controller, which we call the custom controller, which actually works in communication with the default stateful set controller. So the stateful set controller would continue to create pods and create storage and coordinate the mounting and all of that, whereas the custom controller that we built would be in charge of deciding which pods would be deleted next in order to be replaced. So the deletion would be the custom controller's job and rest of it would be these existing stateful set controller's job. So once we had this ability, and this is enabled by a flag called on delete strategy in stateful, if you're interested in looking it up, basically, these custom controller would then enable batching where it would go after a batch of pods, delete them first, and then the stateful controller would notice that these pods are missing and would recreate them with the new configuration, though. Similarly, the custom controller would then move to the next batch of three, in this case, delete them, and stateful would do the remaining part of bringing up these new ones. So in this manner, by coordinating with these existing behavior, we were able to get batch upgrades enabled in Kubernetes, which is a very big problem when we initially faced it. Another limitation that we had was that in Kubernetes, when you are deploying your services, you also define what is called as the pod disruption budget. This is important to make sure that whatever operations you do in your cluster, you don't let the number of unhealthy pods or disrupted pods. To use that terminology, you make sure that you put a limit on how many pods are disrupted. In this case, for example, let's consider that the pod disruption budget is one. What you're saying is that at any given time in your cluster, you'd at most disrupt one pod and not more than that when you're doing any of your administrative tasks. Now, the problem here is that if more than one of your pods is unhealthy, in this case, pod three and pod one are in an unhealthy state because of some issue with them, and you are trying to upgrade that particular stateful set, maybe because you want to fix the issue by deploying new code. Unfortunately, since it always starts with the highest number, pod five in this case, when it tries to upgrade, Kubernetes will prevent it from being upgraded because it would increase the number of unhealthy pods, because you hbase to destroy a healthy pod to create a new one, and it bold increase the number of unhealthy pods as a result. So in this case, again, a custom controller is really useful. What it did was it went after the unhealthy pods first while doing upgrades, instead of just being according to a strict ordering, delete the first unhealthy pod and replace it with the healthy pod as a result, and then go after the next unhealthy pod, replace it with these new one, and then finally go to the healthy pods which can now be replaced because there are no unhealthy pods left. So in this way, we were able to overcome any blockage due to pod disruption budget and move the rolling upgrade forward. Another interesting problem we had, which is kind of unique to stateful applications, I guess, especially things like Zookeeper and many other such services. You have a number of instances, but one of them is elected a leader and it's the leader of the group, and it has certain responsibilities as a result. And to create a leader, you have to go through an election process. So there is some activity and delay involved in doing some of these things. Unfortunately, Kubernetes at its level knows nothing about these leader business. So the controller would typically just go after the highest number of pod, and if that pod is disrupted and a new one is created, the leader might be reelected into one of the older pods. So the next upgrade would hit that leader again, and once again you would have election. And if you're really unlucky, the third pod also would be from the older set of pods. So you end up disrupting the leader these times in this case. But you can imagine in a real cluster, this repeated leader election bold be very disruptive to the cluster. So to avoid this, once again, the custom controller came to a rescue. We built sidecar containers. These are basically logic that runs inside each one of these pods, which checks to see if it's a leader, and it makes that information available through labels in the pod. And the custom controller is basically monitoring all these pods to see which one of them has this leader label on it. And it would then avoid that particular leader and update all the other pods first, then finally go and update the leader. So you end up disrupting the leader pod only once throughout this process, which was a nice capability that we could have thanks to this custom controller. So another area of problems that we experienced was around DNS. As you can imagine in kubernetes, it's a very dynamic world. As pods move from one hosts to another, even though they keep the same hosts names, the IP addresses keep changing. And I kind of went over that earlier. This creates a strange problem because traditional software like hbase, Hadoop file system, et cetera, they were largely developed in an environment where DNS did not change so much. So as a result, there was a lot of bugs in this code base where it would resolve DNS hostname to IP address and cache that information for literally forever in its code. So you can imagine if you had that kind of code, you would have invalid information in the software pretty quickly. And in particular, what we noticed is that if the metadata servers had these IP addresses changing and if a large number of data servers sort of had to talk to these metadata servers, they were kind of losing connection to this metadata server as its ip address changed. Now obviously the fix to this kind of problem is to go into the open source code, find where all these bugs are. These it is holding on to these addresses and fix those bugs. But with a large code base like Hadoop file system and hbase, it's kind of challenging to find all the places that this issue exists. And especially when we had to get our software out and very sort of depending upon our eyeballing capabilities to find all these issues or a testing test matrix to find all these issues seemed a little risky. So what we ended up doing was that even as we went about fixing these bugs, we came up with a solution where for each one of our pods we put a load balancer and it's called a service in Kubernetes. And it's actually not a physical load balancer, it's a virtual one which works using network magic, really there's no physical load balancer involved. So we created this virtual load balancer in front of each one of these metadata server instances. So now what that does is that when you create a load balancer, not only does it get a host name but also an IP address. And that IP address is very static in nature. It doesn't change as long as you don't delete the load balancer. So even though your pods may be changing the IP addresses, the load balancer does not. So when the client is trying to contact these pod, it would first go to the load balancer and then the load balancer would forward the request to the pod. So we sort of recreated that stateless applications methodology, at least for metadata servers, so that we can kind of protect ourselves from IP address related issues. And in the meantime we also went about finding all these bugs using various testing mechanisms and eliminating it. But it gave us some breathing time. Another interesting issue that you have with kubernetes is how DNS actually works inside it. There's actually a dedicated DNS server that is providing all this support for the changing IP addresses. It's called core DNS. It actually runs inside the Kubernetes cluster. And as you create pods of various type and delete it, this core DNS is the one which keeps track of when to create a DNS record and when to delete it. The problem with this approach is that while it all works great on the server side, there's no guarantee that your clients are actually running in the same Kubernetes cluster. Really in the real world your client is typically outside of a Kubernetes cluster. It's probably running in some external host or VM, but not necessarily inside your Kubernetes cluster. And that client is actually depending upon typically a different DNS server, which is the global DNS server that is visible across a large number of environments and not the core DNS, which is visible only inside the Kubernetes cluster. So to deal with this issue, what you have to do is find a way of getting your records from the core DNS into the global DNS. Otherwise your client would not know how to contact all your services. So in our case we use an open source tool called external DNS. It's something that is open source and most people use it when they're trying to deal with this particular scenario. And what external DNS does is that it transfers these DNS records that are within the Kubernetes cluster into this global DNS server. I've simplified the picture here by showing that it's actually moving data from core DNS to global DNS. In reality that's not exactly how it does it, but in effect it has the same impact. It makes sure that those DNS records are available in global DNS. Once they are available in global DNS, the client is able to then contact your data servers and communicate with them effectively now one challenge with this approach is that external DNS only runs periodically, every minute or so. So your DNS records are not immediately available in global DNS. So for example, if data server four here is just booting up, it should not go online until it's absolutely certain that its DNS records are available in global DNS. So we have to actually build logic to make sure that it can validate that global DNS has actually got its DNS records. Once it's confirmed, only then would the data server declare itself as available for traffic. So this is one of those steps you kind of might have to deal with in the real world when you're trying to use Kubernetes and stateful applications in general. Now finally, I want to talk a little bit about scalability architecture in public cloud. Typically you deploy software in a certain region. You can deploy it across multiple regions, but if you are doing a high performance software that needs very low latency, you deploy that software in a particular region, which is really a geographical region like us east or US west. And within that region you can also spread your software across different availability zones. Availability zones can mean different things for different cloud providers, but typically it is either a separate building, a separate campus even, but very close to each other, so that the network latency between the different availability zones is not too high. So you can actually spread your software across it without experiencing too much of a performance issue. So I'll be calling availability zones AZ for short here. So the goal is typically for you to take a few AZ. In our case we took three AZ approach and make sure that your software is spread across the instances of your software are spread across each one of these AZ to achieve this. Fortunately in kubernetes there has been significant effort to make sure that you can support this kind of deployment. So they have got something called affinity and anti affinity rules where you can tell the Kubernetes scheduler to spread the pods across different AZ. And the way they do it is that the hosts that run in each AZ have a certain label indicating what AZ that hosts is running in. And then you can tell Kubernetes that, make sure that when you deploy these pods, they run on hosts that have different label values as much as possible. Obviously you will have more than three pods, so you're not going to be able to spread these all on different AZ, but you do your best effort to equally balance it across different AZ. Now that takes care of making sure that your software is running on different azs, but what about the data inside that software. A good example of it is Hadoop file system, which keeps three copies of data for high availability reasons. Now you want to make sure that that copy each one of those copies is running in different AZ for safety reasons. So fortunately in Hadoop itself, when they designed it, they introduced this concept called rack topology, which is sort of the traditional data center terminology where you tell Hadoop fails system. In particular it's metadata server. What is the topology of your servers? In which racks do they run in? And Hadoop will make sure that these replicas are kind of distributed on different racks, so that if one whole rack goes down, you still got other racks that can serve the data. We were then able to convince Hadoop through using its script based interface that each AZ is a different rack. And thereby Hadoop was able to spread these replicas across different azs using that mapping. The metadata server itself had multiple instances which are again spread across different AZ using the same affinity anti affinity rules that Kubernetes supports. So what you achieve with all these spread is that if an entire AZ goes away due to a power outage or a network outage, you still have the software and its data available in the other AZ and still serving traffic in spite of this outage. So it's really useful for resilience in general. So that's pretty much all I had for today's talk. Thank you so much for listening.
...

Dhiraj Hegde

Architect @ Salesforce

Dhiraj Hegde's LinkedIn account Dhiraj Hegde's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways