Conf42 Cloud Native 2022 - Online

An Introduction to Service Mesh in Action

Video size:

Abstract

Have you heard the term service mesh but have no idea what that means? Are you architecting, developing, or responsible for running distributed applications? This talk will make you smarter on concepts around service mesh technology. We also will get a little nerdy and explore the awesome platform capabilities you get from an Istio based service mesh running containerized applications and services.

Why? As modern applications move toward microservices based architectures the importance of a platform to back both the development and operational work grows. Development teams struggle with building, debugging, and connecting services properly. And application operations teams face increasing challenges with securing hybrid deployments, scaling bottlenecks, recovering from failure, and gathering metrics.

This session will feature an introduction to the technology and also give a demo showcasing key capabilities a service mesh platform brings to connect, secure, and observe microservice based applications.

Summary

  • Jason Dudash is a chief architect at Red Hat. He focuses on modern application development and cloud technologies. Today he will talk about the introduction to service mesh. At the end he will demonstrate a lot of the key concepts.
  • Distributing your software across multiple systems is advantageous, but it also brings a lot of challenges. The challenges are inherent to distributing software capability across networks. These challenges have existed for a long time, but they're even worse today because we're transitioning monolithic systems into microservice architectures.
  • Kiali is like the main dashboard for the service mesh. It lets you create boards and add little items to boards. In this particular example I've introduced some problems so that we can explore the observability features in action.
  • In Grafana, we can see that this user profile service is causing us long delays. We can also see that same information via trace bands in this distributed tracing dashboard. You can see how quickly you can fix a problem with routing by just changing the dynamic rules behind the scenes.
  • Another traffic management capability I want to showcase is called circuit breaking. If the network has detected failures that are happening at a certain threshold, we'll trip that circuit, and we'll prevent further calls from being made. Security wise, you can also have the services verify web tokens for trust.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Jason Dudash. I'm a chief architect at Red Hat, and I focus on modern application development and cloud technologies. Today I'm going to talk about the introduction to service mesh, and I'm going to focus on the, the how, the what, the why of service mesh. And I apologize in advance. This is going to be a little bit of a fire hose. I've got a lot of information I want to get through in kind of a short amount of time to do the talk. So I'll also, at the end, be demonstrating a lot of the key concepts that I talk about in the beginning slides. So really to set the stage, what we're talking about here is in the context of distributed computing, distributed systems at the most basic level, and distributed systems are great. There's a lot of capability that you can get from distributing your software across multiple geographies, across multiple systems, and you can meet all these non functional requirements, right? Those things that systems engineers like to call the illities, the scalability, the supportability, the reliability. And so this isn't a new concept, right? So distributed systems have been around since ethernet was invented, like in the late 70s, but it's just now become more ubiquitous and we see things happening at much larger scales than ever before. Distributing your software across multiple systems is really advantageous, but it also brings a lot of challenges, and those challenges are inherent to distributing software capability across networks. And so there were these things that were identified 25 plus years ago, known as the fallacies of distributed computing, and it impacts the way that we develop, deploy and manage our software systems. So things that novice developers don't think about, maybe experienced developers are writing some additional code to deal with the challenges you're facing. But for the most part, people aren't thinking about the reliability of the network, the latency that you're going to expect to experience in production, the bandwidth that you'll have, and even like, security concerns are often overlooked. And so those sorts of challenges have existed for a long time, but they're even worse, and they're even more impactful today because we're trending towards moving out of our data centers into cloud environments and we're transitioning monolithic systems into microservice architectures. And those microservice architectures bring us all this extra agility, but it also means that we're distributing things at a much larger scale than ever before. And so all these independently scalable, single purpose services that compose your overall application means lots and lots of little network connections back and forth and chains of network connections between all these services. And so you've probably seen this before. If you've been building microservice architectures, I've definitely seen it in my systems and customer systems that I work with. But once you start building these things, everything looks good. In development, everything actually looks good. A lot of cases in test, you fix a lot of bugs in QA and you're like, cool, let's ship this thing. Everything's good to go. But once you get into production, things become less predictable. And especially over time and under production level loads, things really don't perform the way you expected them to perform. And scaling up isn't like the solution that fixes a lot of the problems you have. In fact, when we fix problems, we do that with some workarounds. You might be thinking, hey, I know there's a lot of companies that are successful if you haven't done microservices already, there's a lot of companies that are doing it and they're very successful. And you're right, because they've found ways to work around and deal with those types of challenge. So historically, what we've seen is those challenges are addressed by boilerplate code and third party libraries. Netflix is probably best well known for creating some of these things, the eureka and zool. These frameworks get bundled into every microservice and provide solutions to deal with the things we're talking. But, but that's not really ideal. It can kind of reduce agility. And we're talking about adding extra work to developers plates to incorporate, load these libraries in and actually manage these dependency chains. Right? So imagine how much more challenges that gets if you're not only developing Java based microservices, but you're also using go and you're also using node js. And now you've got this problem. It's kind of replicated across the different tools and the different programming languages that you need to support. And so what we're talking about here today is a common approach to deal with those challenges by moving the responsibility to the platform and so you can address it at the infrastructure layer so developers don't have to reinvent the wheel for each new service that they're developing. And so that lets us apply policy consistently, also across an entire application, across an entire series, instead of microservices. So I think a really good analogy that helps explain what a service mesh does is with roads and traffic control. If your company or your organization is a city, then the red hat connect people's homes and businesses and places of work, those are the networks, right? And so if you live in a really small town, you have just a few roads, you might not need a whole lot of traffic control, but once you get to a city of a certain size, you're probably now in a position where you can't trust everyone to obey the speed limit and do the right thing, or even that they'll do something the same way as each other if there's no traffic signs and there's no guidance for them to do those things. So what do we do in a city? We put in place traffic control. We have police officers, we have speed limit signs, we have bike lanes, we have stoplights and walk signs and don't walk signs. And we control what's going on in that city. And the same thing should be true of organizations that are deploying microservice based applications across Kubernetes environments. So you need to be able to assert control over how traffic moves between those services. And the service mesh is the control plane for asserting that control. Right? So you could probably take the analogy even further if you wanted to, and talk about how observability is important because cities also have traffic cameras. And if you can see what's going on in traffic, you can identify bottlenecks in the system and you can audit and figure out how to improve those things. Again, service mesh has observability capabilities as well, and we'll get into that when I get into the demo. But under the hood it's pretty straightforward. I'm going to give a high chief architecting overview of the service mesh. It starts with a Kubernetes cluster like Openshift. And so your service mesh is part of that platform. And there are two big concepts of a data plane and a control plane, and I'll explain both of those. So the data plane is essentially this mediation layer that controls all the network communication between the microservices. That's its role in life, and it does that transparently. And one of the really cool things about this is how it works is that the mesh deploys a sidecar container, which is a Kubernetes architecture pattern that is colocated with your application. And so your applications are in this data plane. They're all talking to each other, but they're doing that through this sidecar proxy called envoy. That's the open source project. It's a really fast and dynamically configurable proxy. There's an API it provides so that you don't have to reload things. It just concepts, new configurations. And so we are able to program these envoy sidecars to do all the policy enforcement that we've identified, and we define that policy in a control plane layer. So your policy is part of this control plane. And the reason this is really really important is because imagine you have hundreds of microservices and hundreds of proxies. You wouldn't want to programmatically have to configure each of one of those individually. The control plane lets you define your policy and it applies it across all of your proxies for you. And so the separation of the control and the data planes lets you make changes to your mesh without having to change any of your application source code. And honestly that's like really probably the most cool part about all this, is that it's truly dynamic and you're solving your challenges without having to write new code, without having to rebuild your services. And once your services are part of the mesh, you don't even have to redeploy containerized into kubernetes to apply policy changes. So with that introduction, that firehose of information, let's take a look at it in action and see how some of this stuff works. Okay, let's dig into some observability capabilities of the mesh. I've got a simple microservices application here with a single sign on and a user interface and a profile service and a couple of databases. And altogether those microservices make up this web application. So it lets you create boards and add little items to boards. So I can come over here and I can say storing something to share red hat to the list. And in this particular example I've introduced some problems so that we can explore the observability features in action. So there's three main observability tools that I want to showcase. The first one is called Kiali, and it's like the main dashboard for the service mesh. So I can come into this graph view here in Kiali and I can see everything that's happening. I can see the ingress into my mesh, I can see the services that are running and what workloads are behind them. And so you see the user interface called App UI running the latest version of its container aboard service, which provides an API to edit and store data into a database that's MongoDB. And then our user profile service that actually has two different versions backing it, version one and version two. And so I can also see that same information in this applications view and I can click on this app UI, it gives me the little graph overview, but I can click on traffic, I can see all the inbound sources of data and the outbound destinations. I can see the protocol types and some metrics and their success rates on all these things. So right now I'm running a couple of for loops to just ping, simulate some user load here. And there's some problems with this load and so we're going to dig into that. So over here in the Grafana dashboard, I've opened up the service viewpoint and it shows me data and metrics about what's happening with my services. Right now I've got this user profile service selected. I could select one of the other services if I wanted to see data on that. But if I scroll down and see what's going on, I can right away see that these graphs show me incoming requests are getting satisfied very slowly in some cases. And we're seeing 20 seconds for this user profile, two service to respond, and only three to five milliseconds for version one. So that's a problem. And I can see that same information via trace bands in this distributed tracing dashboard. So if I select the services for the user interface and the operation call to the profile and I click find traces, we'll see these drastically different bubbles taking a lot longer in these calls than these calls down at the bottom. And this tracing tool comes in really handy when you've got microservice chains that are like long calls that provide a return path to display some information on a GUI or something like that. And when things go wrong, it lets you dig down into the details and see exactly where the problem is. In this case, the problem is at the end of the chain. So it just looks like the bars are full. But if something happened in the middle, it would be really obvious and visualized very nicely to see that. But yeah, you can get a lot of information from this trace span, all the HTTP header information. And you can see again like we saw in Grafana, that this user profile two is causing us these long delays. And that looks like this on the app. If I click profile it's like oh man, it's chugging along, but nothing's happening. It's just ticking, ticking and ticking, and then finally it comes up. So those observability pieces have told us something's going wrong. So what we're going to do to fix that is we're going to go run some commands to apply some policy. So the first thing I'm going to do is create some destination rules and virtual services and I can see by using Openshift or kubectl type commands to see what those things are. And I can see I've got destination rules, and I've got virtual services now. And if I want to go and look at what that looks like in Kali now, things are a little bit different now that we're in Kali, I can go to this istio config and I can see all those different configuration items that we created. We'll notice that there's actually pretty nice capability where if you've got problems in this case, I've got an intentional problem in a destination rule. It's going to give you this error that tells you something's not right. But let's go back over and show you how we can apply some policy. We're going to change a destination to say, hey, we saw that we have this virtual service in Kiali that was, if you remember, splitting traffic between version one and version two. And we can see that by doing that. And if I want to flip, which I just applied. Oops, sorry. I can apply this configuration to flip all that traffic back from version two, which was giving us problems, to version one of our services. And I'm already throwing all this load in, so it should happen pretty immediately. And we should see that when I go to the profile now, it should just work. And I'm not getting those long delay problems right. So that works here. And I'm going to go to Kiali. We'll see that traffic has shifted over the last minute or so, and it'll eventually get up to 100%. So right away you can see how quickly you can fix a problem with routing by just changing the dynamic rules behind the scenes. So another thing I could do is deploy a third version of this service, and let's do that right now. I'm going to add a v three. So let's say we fixed that problem in v two. We want to go now ahead and patch that, what we had. So we'll run a command to create the user profile, v three. We'll take a check here and make sure it comes up and runs. It's almost there. It's trying to find itself a stable state. Cool. It looks like it's running. That's good. Now what we want to do is route traffic to it. But instead of making the same mistake we made last time, let's do a canary deployment, which is an advanced deployments technique to put just some of the traffic, but there. So we'll say, like, let's put 90% of the traffic still going to go to the one we know that works, but 10% will start shifting over to this newer service. And let's curl some. Actually, I'm already curling it, so it's getting traffic already. So now if I go back over to Kiali, I can see this version three is showed up, and it should start getting traffic pretty quickly now that we've applied that rule. So let's see. Yes, there we go. It's starting to shift up to 10% of that traffic over. And so it's just averaging the last 1 minute of data. So that's why it took a little while to catch up. But the traffic was already starting to show up there as soon as I hit enter on that command in the command line. And so let's go back over our toward. Let's start popping in. That's v three. That's v one. V one. V one. V one. V one. V one. V three. V one. Cool. So loading fast. We fixed that 20 to 32nd delay bug. Everything looks good. So now we could go ahead and shift it to like a 50 50 if we wanted to and see that. And then eventually we would just say, hey, let's put everything over to version three, which would look like that. Another traffic management capability I want to showcase is called circuit breaking. And if we go back to the profile page, we can see that right now I'm balancing traffic between version three of the profile service and version one of the profile service. And you can see there are different colors. And the circuit breaking concept is essentially that similar to a circuit breaker in your house, where if there's going to be a circuit overload, current is going to overload what's going on and create cause problems. Then it interrupts that flow. And in this case, if the network has detected failures that are happening at a certain threshold, we'll trip that circuit, and we'll prevent further calls from being made. We'll eject, essentially, that workload from the serviceable pool. So this looks kind of like this. We'll create a destination rule, and down here, we'll define an outlier section, and we'll say how many consecutive errors, and we'll define some other properties and policy details. And if I start sending load again and I go over to the kiali dashboard, I can see. Give it a second. There we go. So we're getting load, and you can see this 50 50 balance is happening. You can see also that I've created a. It's visualizing that there's a circuit breaker here, has circuit breaker. And if I wreak some havoc over in my cluster and just kick out delete that pod, the version three pod will just destroy it. So where's that going to be? This. So we killed version three. If I flip back over to here, we'll notice that version three will stop working in a second. And you can see here version three started getting errors and now the traffic is being sent back over to version one because that circuit breaker has tripped. Okay, so I flipped over here to the container platform dashboard to showcasing. A lot of things we've been doing are with the command line, but there's also graphical ways to do that. Right now we're looking at the control plane overview for the installed service mesh. And we can see that while we turned on all the observability capability, we didn't turn on security capabilities for the control plane or the data plane. And we want to do that now because we decided that maybe our policy should take into account the need to encrypt all the service to service communication. So if I go over here now, I can kind of visualize and showcase like with a simple little curl command that hey, anybody can jump on, run a container and get the data out of these microservices. So we're not really enforcing a strong identity in a nice secure way. So let's fix that by adding a peer authentication policy. And that's pretty straightforward. It just looks like this, we'll do a create command, and now I need to set some destination rules to tell the rest of the services to communicate using mutual tls. So I'll do that right now. That's going to create the destination rules. And the quick test, if I run that curl command again, we'll see that it failed because it wasn't a known identity, it wasn't a member of the mesh that was able to make these sorts of calls. And if I flip back over to the web console, I could have easily done that in a similar way just by turning the checkboxes on for the whole mesh. So that would be another way to do it. The last thing I want to showcase security wise is just a quick look at some of the additional resources that you can configure beyond just the mutual TLS. You can also have the services verify JSON web tokens for trust, and that would be through a request authentication policy. And in addition to authentication, authorization can be done via authorization policies. And those both look a little bit like that. So with that demo, that's all the things we have taught to demo today. We just scratched the service today so there's a whole lot more information. If you want to get deeper, I recommend you go to developers Red Hat check out the service mesh topic. We've got tutorials and articles and videos you can check out so it's a great resource for you. And if you want to ask me questions, reach out to me directly. Feel free to scan this QR code, it'll take you to my social accounts and also red hat links are below so you can find out more about red Hat. Thank you for watching.
...

Jason Dudash

Chief Architect @ Red Hat

Jason Dudash's LinkedIn account Jason Dudash's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways