Conf42 Cloud Native 2022 - Online

A Developer’s Introduction to Service Mesh

Video size:


In the ideal development practice, we secure, shape, and observe traffic between services with a single line of code. However, most environments have multiple types of applications running many versions across diverse workloads and platforms, from containers to public cloud to private datacenter. With so many platforms and application frameworks, you cannot use the same code libraries across all services to shape traffic, secure communications, or enhance observability. How can we reduce the development and operational complexity?

In this session, you dive into why and how a service mesh can alleviate the management complexity of shaping, securing, and observing traffic across multiple platforms and environments. First, I’ll provide a short introduction to the session’s setup, which uses HashiCorp Consul and Envoy proxy on Kubernetes. Then, you will learn how to implement and debug traffic shaping and certificate management in the mesh. Finally, you will configure tracing and metrics collection for your service mesh application and examine the telemetry in Prometheus and Jaeger. We’ll compare the experience of using a service mesh to various programming language implementations and discuss how to extend the mesh across different workloads.


  • Service meshes rely on something called proxies. It's not just about securing and encrypting the communications between services. There are a lot of pieces to how applications connect to each other. This is a rapid introduction into all of the different ways a service mesh affects your application.
  • Service mesh is mapping that interface to the proxy configuration. Security requires a couple different abstractions when it comes to loading a certificate or doing API authorization. In spring, it's really easy to build yourself a global security method.
  • The circuit breaker sets the maximum pending and current connections for the upstream services. Outlier detection does the circuit breaking and ejects the service instance once certain number of failures reach a threshold. There are some nuances to this so you can implement this again as an abstraction in service mesh.
  • There's two sources of telemetry, and that's for metrics and for traces. You must have both application and service mesh. Not all the information in the service mesh metrics and traces will help unless you have the application side set up to do that.
  • Service discovery, balance, cloud balancing, security, traffic management, and telemetry. You can do this within an application, but you can also abstract some of these functionalities away into a service mesh. If you have any questions about what the appropriate configurations are, you're more than welcome to reach out to me.


This transcript was autogenerated. To make changes, submit a PR.
This is a developer's introduction to service mesh. I realized that a lot of the service mesh resources that I was seeing had can operator approach to it, which is how do you build a service mesh? How do you configure it? But it turns out there's a lot of developer capability that you need to take in service mesh and use for your applications in order to get value out of it. So if you're a developer or you're an operator, who needs to maybe enable developers on how to use a service mesh? This is a very rapid introduction and overview into all of the different ways a service mesh affects your application. So my journey into service mesh started with this very vague statement of we must have a service mesh. And a security engineer approached me with this concern. I was a little bit confused. I wasn't sure what a service mesh was at the time. I was kind of doing an embed with an application development team as sort of an operations or infrastructure engineer, and it was a very interesting statement. I had never heard of it. And eventually I got to the core of what the engineer was looking for, and they mentioned to me that they wanted service to service communication with mtls. Basically, they wanted each service to communicate encrypted with a certificate. And when they approached the application team for this requirement, the concern was that it would take way too long to refactor every single application and all of the code to use certificates. And that is a valid concern. Do you need service to service mtls? If for the most part you've secured all your applications internally into a private network? Well, you can never be too sure. So the security team was looking for a way to secure communications with mtls, point to point with applications and their research introduction service mesh. Now, as I investigated service mesh a little bit further, it turns out there's a lot of pieces to how applications connect to each other. It's not just about securing and encrypting the communications between services. Turns out, services need to discover each other. We usually did this by DNS. So how does service one get to service two second services? Load balance. You need to be able to load balance between instances of services as well as between different services. Security was the concern that first came to me. But besides mtls, there's also authorization too. So are services allowed to communicate with each other on an API? There were a lot of sophisticated tools out there, as well as code libraries that were supporting this in applications. And did we really want to change that? Finally, traffic management. Some applications might be a little bit more sophisticated in how they require retry handling as well as error handling and telemetry, which we were trying for brick tracing. It was really difficult to implement and we were trying to pretty much get metrics unified across the board. So all of these functions were ways in which services communicated with each other or the ways that they needed to interact with each other. We weren't really sure what a good answer was because right now all of these different kinds of concepts required multiple tools. And so I did more research on the service mesh to try to understand why it solves this problem. And it comes down to this. Service meshes rely on something called proxies. In this case, we're talking about envoy proxy as a tool, but there are many other proxies as well as service meshes with other custom proxies. But in this case, we'll just focus a little bit more on envoy. For every application instance that you have, for example, I have report v two, report v three, expense v one, and expense v two, I have a proxy running next to it. The proxy is responsible for all communications between services. So anytime you need to communicate out of report v two, it goes through proxy. Anytime it comes in, it's through proxy. This has an interesting side effect. If you have multiple application frameworks, which is usually the case in larger companies, you have the ability to, well, direct traffic through the proxies. And as a side effect, this means that you can build abstractions or almost a layer on top of the proxies. For example, the expense proxies represent as a whole the expense service, whether version one, version two. Similarly, the report service represents the abstraction of report version two and version three, and the proxies can represent that. Now report communications to expense. Similarly, everything goes through the proxies. So the proxies control whether or not report service can communicate to expense, the upstream service. All of this, right, plus some kind of control plane equals a service mesh. And when we mention control plane, we mentioned it as a way to say you can push configurations out to each proxy and a service mesh pushes configuration out to each proxy. So if, for example, I wanted to create a report service, the service mesh would create the abstraction of the report service and send that configuration to proxies. Now, regardless of which service mesh you use, for the most part, they're all using a very similar approach in the way that they're pushing the configuration out to the proxies. So most of the configurations you'll see today, while they are consul focused or they're envoy proxy focused, you'll see similar functionality in other service meshes. My hope is that if you're using something else, you'll be able to understand the terminology, the generic terminology, and apply it to your application. So in this case we are able to create a service mesh configuration, push it out to proxies. Now if you're in kubernetes, it's pretty easy to add the proxy in place. The idea is that you can use an annotation or you can inject it by default. So most service meshes will allow you to add a service mesh annotation and it will inject the proxy for you and the proxy for many of the Kubernetes ones are envoy, although some of the other service meshes use different proxy tools. So the idea is that if you're doing this in Kubernetes, you can do the annotation in console. Does do service mesh outside of kubernetes as well. In this case you do have to add the process sidecar. Process sidecar proxy. So if you're doing this on a virtual machine for a much older application, you will have to deploy the binary for the proxy and then configure it as a process on the virtual machine. So next, what does this mean? Well, this configuration, this abstraction, pushing all of these things into a service mesh means that you can configure service discovery, load balancing, security, traffic management and telemetry in one place irrespective of the application code and the library. So if you're a development team or you're an operations team tracing to enable development team doing this, the idea is that you're replacing the functionality that you might have implemented or already implemented in code for service discovery, cloud balancing, security, traffic management and arguably telemetry into a service mesh. So we're going to go through all five of these. In the case of service discovery, remember there's two sets of abstractions, the application options as well as the service mesh. Application side options typically involve libraries like Eureka if you're in the securing ecosystem, DNS or Kubernetes services. If you're on Kubernetes, a service mesh does this all with proxy registration. So in the application code case in a programming language that allows you to do this, you can do something as easily as adding an annotation in your application. So this is securing and in this case I'm enabling discovery client and now I've got service discovery for spring applications. Problematically, not all applications are using spring or Java for that matter. So you may have heterogeneous workloads with different kinds of application frameworks, and in which case maybe the service mesh service discovery approach is actually much more useful. Service mesh will again create the abstraction of the report service for v two, v three, as well as the expense service for v one, v two. Doesn't matter what application frameworks they are when you look at a service mesh admin configuration. So if you're looking at the proxy admin configuration, most of them will have this cluster's endpoint. And this cluster's endpoint has a list of the service name, such as expense, as well as the ip address. So in this case I'm going to the proxy, the envoy proxy, running an API call to just do some debugging. And if you examine this debug interface, you'll see that there's can expense mapping, Jaeger mapping, expense v two mapping to each IP address. This is actually pushed out because when the proxy registers it has information about the service and consul itself pushes that information further to the proxies. So that is where you're getting the service discovery piece. In the case of load balancing, you also have two options here, application side. Again, you can use a library like thane load balancers and DNS. The combination usually give you some kind of load balancing configuration. In the case of service mesh, you're using pretty much just proxy configuration. So again, if you're lucky and you're using something like spring, you have enable fang clients and that injects a client that allows you to load balance between certain service instances or application instances. In the case of multiple application frameworks, well, a service mesh again takes that abstraction, pushes it out into a separate layer. So in this case you can use a service mesh to push configuration out 50% to version 150 percent to versions two. What this looks like is that if you go into console, for example, and I retrieve service splitter configuration, what this does is that it outlines sort of an expense service splitter, for example. And if I print out the CRD or the custom resource for it, you'll notice that 50% of the weight goes to v 150 percent goes to v two. All of this is done through my interface of choice for my service mesh. So in this case, this is a custom resource definition in Kubernetes. But you could do this with an API call to console, API call to any other service mesh. Why is this important? Well, when you examine this in your service mesh configuration or your proxy configuration, your service mesh is mapping that interface, that declarative interface that you've made on the weight to your proxy configuration so effectively what it's doing is that it's doing that transformation for you. So you'll also notice the weights 5000 5000 expense and expense v two as well as the total weight. So this is on the administrative side of the proxy itself. So the proxy has JSOn and this is actually available for you to see. So basically service meshes are pushing all this configuration out to the proxies and now the proxy have awareness of all of these weights that you need. The benefit of this is that if you are accessing it from report service, so that's what I'm going to do. I'm going to access this through an API call from my upstream service to my expense service and I'm just going to get the expense version. You'll notice that it is load balancing between the Java version which is o zero one snapshot as well as the net version which is 60. So all of these are configure through one interface and pushed out. So irrespective of whether or not net or Java or any other application framework, you have a single abstraction to do that. So security, this is where I started my journey and this is where I first heard about service mesh. And there were some misnomers to it, right? Security requires a couple different abstractions when it comes to loading a certificate or doing API authorization. So libraries will, and write your own libraries often allow sort of an easy interface to side cloud a certificate or validate it if you want. On top of that, if you're doing something like API authorization, for example, report can only access expense on the version endpoint. That API authorization flow could be done separately by a server, a special server, or it can do it by OIDC or job. In the case of service mesh, it's a little bit different. You get mtls out of the box between proxies as well as proxy filters, and the proxy filters help you filter traffic based on API authorization endpoints. So I'll actually show this. First we'll talk about the application side and then we'll talk about service mesh. But on the application side, the complicated complaint that I was getting from a number of developers for quite some time was that they would have to add their own certificate validation code into their codebase. And this is taken from the ASP net core documentation. But for example, in the case of. Net you'll have to add a validation event and you'll have to add your own logic for that. So it can be quite a bit of code. In the case of service mesh, mtls is a little bit different mtls happens between each of the proxies. So proxy from v one to v three to v one, v two, v three across. All of these services are all mtls, so they're all encrypted. However, they're not encrypted between proxy and report, for example, report v three, for example. So each proxy that is running sidecar with the report or expense instance is not going to have any mtls. So that's where the caveat is, but mtls is going to be within the mesh and between the proxies. Now, if you're looking at this in the service mesh, you can actually see that it is applying a certificate to each proxy. So if you do a config dump, which is again the administrative interface for envoy proxy, you'll notice that there's a certificate chain as well as a private key and a validation context. So all of this is done within the mesh. So you get mtls between proxies, effectively point to point, it's unencrypted between the proxy and the application instance. Second piece of this is API authorization. API authorization is whether or not report can communications to expense. Can it do it on certain API endpoints, can you only do it on certain methods? Now in spring, it's really easy to get this done in that you have an oauth two client annotation as well as a global method security annotation, and then you can configure how services communicate to each other. But if you have something like net go or something else that doesn't really exist, it's not that easy to implement. You have to build yourself. So in this case you can push it into once again the service mesh. So for example, in this service mesh, I'm allowing report to access API expense trip on the expense service. That API authorization means that if the traffic going through the proxy accesses an endpoint to the expense service that's not API expense trip, it will not be allowed to do so. It's a little confusing, and there's a lot of text in this, but the idea is that if you're doing a dump on the administrative interface of envoy proxy, you'll notice that there's a filter implementation. This filter implementation adds the rules for access between services. So in this case, the principal report can access expense on the path prefix of API expense trip. However, it's not allowed to access anything else. Now, if you were to look at this not as part of envoy proxy, and you were to look at this in a much, I would say a much more user friendly way. You can see this as part of, let's say something called intentions in consul console basically abstracts these proxy configurations and will sort of give you a more intent driven view of how it works. But effectively what it's doing is that when you create a custom resource and called an intention, the intention describes you can allow report to access expense on API expense trip using a get. You can also get from the API from report, but you cannot do anything else. So in this case, this intention is mapping down to the proxy configuration that I showed earlier. So traffic management, this one's a little bit more complicated. It can get very very lengthy to describe, and so I'm going to try to abbreviate this. But in application space, especially with services, we talk a lot about circuit breaking, retry handling, the importance of error handling, and most of these have been traditionally done by libraries. So there were libraries that would allow you to circuit break based on certain configurations, or you would write your own kind of retry handling, which does happen. In the case of service mesh, you can do a similar functionality. There is a bit of a confusing terminology shift in that if you're using something like envoy, a circuit breaker is not quite the same as the circuit breaking pattern. The circuit breaker sets the maximum pending and current connections for the upstream services, and then outlier detection does the collection. So technically outlier detection does the circuit breaking and ejects the service instance once certain number of failures reach a threshold. But the communications of the two combined implement the circuit breaker pattern. So if you're familiar with that from can application view, you'll need the combination. So in the case of securing really nice, you enable circuit breaker. It's an annotation there. It makes it super easy. In the case of net, it's a little bit trickier. You have to write your own circuit breaker policy. So in this case, the trouble with this is that if you want a holistic view across all of your services about how they're circuit breaking on each other and all of their behaviors, you'll have to scan through all of the code in order to find that information. So in this situation, you do have to consider how do you inject this information into each application. And if it's not using net and it's using something different, and you're doing this across multiple services, you need to keep track of what kind of circuit breaker behavior is happening. So there are some nuances to this so you can implement this again as an abstraction in service mesh. If there's a certain number of HTTP 500 errors like greater than three, eject the service and then divert traffic to the other service version. So this is pretty useful. If, for example, you rolled out expense v two and there are a ton of errors in it, then circuit breaking will eject the service and then divert everything by default to expense v one. Circuit breaking does require a little bit more time to show. I'm not going to show that today for the sake of time, but if you're interested in seeing this, there are a couple of interesting videos to show how circuit breaking in service mesh works in greater depth. Now, in order to configure this in the console side, I won't show this in the envoy config because it's a big, rather large config. But if you're configuring this from your service mesh and you're pushing it into your envoy configure, you would configure something in consul called a passive health check. Finally, this is probably the one that is my favorite, but also the one that I get commonly asked questions for telemetry is a little bit tricky. There's two sources of telemetry, and that's for metrics and for traces. So when I say telemetry, it's for metrics and traces both. But there's actually two sources of telemetry you need. There's application side sources. So this is like the libraries for open telemetry, the Prometheus exporters, or you write your own application side options. Then there's the service mesh telemetry. So the service mesh telemetry has proxy metrics, proxy traces. One of the things that you have to understand with telemetry is that you must have both application and service mesh. Just because you have a service mesh doesn't mean that you get telemetry out of the box. Not all the information in the service mesh metrics and traces will help unless you have the application side set up to do that. So one thing to consider is you need instrumentation for your application. You cannot omit this. Your application needs instrumentation specifically for tracing because it needs to propagate the traces. So if you do not have metrics or tracing in your application, adding the service mesh doesn't necessarily give you that out of the box. So you still need that. If you're looking at something like net, I'm using open telemetry, I just add open telemetry metrics as well as add open telemetry tracing. And easy enough, it creates the metrics as well as traces that I need. In this case I'm using Prometheus as well as I'm exporting zipkin spans. In the case of open telemetry for Java, open telemetry for Java has an agent, so you don't actually need to add anything to your application code. Instead you load this library and then you add some configurations. Again, I'm using Zipkin Prometheus. You have to keep them consistent. If they're not consistent, then traces in particular will not go through correctly. So the service mesh configuration for tracing is a little bit different. You first have to configure your service mesh to expose the proxy traces. So the proxies themselves carry trace information. You want to expose those. So the way you do that is that if you're in envoy or if you're in a service mesh, a service mesh will push this tracer config into envoy. And if you check the proxies, the proxies will have the envoy trace config for, let's say Zipkin, and then you can assign the collector cluster. In this case I'm using Jaeger as well as the endpoint. One thing that I found that was very difficult for this situation is that you have to make sure that whatever instrumentation library you're using and the export format for traces must match the tracer that you're using in envoy or your service mesh. So for a very long time, the envoy version, older envoy versions pretty much supported Zipkin formats and that was pretty much all it would use. Now it has much more tracer options for you, so just make sure it's consistent. In this case, I just standardized on Zipkin because previous libraries did not support let's say like open tracing or other libraries, it was just using Zipkin. So as the lowest common denominator for all my applications, I just chose Zipkin spans. And in this case I would use the Zipkin format for specifically envoy. In the case of metrics, you need to expose service mesh and proxy metrics. Those actually do come out of the box as long as you enable them. In the case of consul, for example, I'm just doing envoy Prometheus bind address on 2000 and 20,200 and that pretty much enables the proxy metrics in Prometheus format. Now the trick however, is that if you really want to get the benefit of metrics, you have to merge the metrics that you instrumented in your application with the proxy metrics endpoint. So most service meshes allow you to merge the application metrics with the proxy metrics. And this is something you will need to add or I highly recommend you add. In the case of console, you can add an annotation that says enable metrics merging equals true, and then you tell it which service metrics port. The metrics are available on. The metrics port is on the application. So in this case I have 94 64. It was really convenient. The result is that when you get the metrics endpoint from the proxy, not from the application, from the proxy, you'll notice that it merges the envoy metrics as well as the, let's say, runtime JVN metrics. This is in the case of Java, but the idea is that you want to expose the application metrics for Prometheus to use, merge the metrics into the envoy proxy endpoint, and that way Prometheus can scrape it in one place. So they're not just you protect your application that way, right? So in the case of mesh, what you're trying to do is just keep your application, avoid it from being publicly available. So what you're doing is you're scraping the envoy endpoint, envoy proxy endpoint, which merges the metrics. So that's where trick. For those who are trying to do this and you've invested into instrumenting your application, you want to make sure this is done. This is where it is. If you do all of this, right? In the case of service mesh, if you do all of this, what you end up seeing is a very different kind of trace, and it's not vastly different, but you do get a little bit more information. So here I've been trying to issue traces across different commands. So previously, before you'll notice that I did some traces, I'm using Kong as an API gateway. Kong itself is also in the service mesh, actually. So you'll notice there's like proxy information here about where it is, its peer, et cetera. And then you'll notice there's actually a component proxy. This is the envoy trace here. So envoy trace includes the internal span format, lets me know exactly where it's going. It's a report, it's report v three. So this is where I know it's going to the version three. You'll notice that these are my application traces. So this is from open telemetry. I added open telemetry in here and it's tracking the calls to the controller as well. So you'll notice hotel libraries name as well as the get subsequent nested child spans here as well. This is calling expense, so you'll notice that it's furthermore calling expense. And then you'll notice that there's the demo expenses. So this is calling the database. So the full trace here is available. But the only reason why it works is that I have turned on tracing implementations in every part of the expected trace. So from proxies to the gateways to internal instrumentation within the applications, I need to make sure to propagate all of them. All right, so we talked about these five different concepts, service discovery, balance, cloud balancing, security, traffic management, and telemetry. All of these are very, very central to how services communicate to each other. You can do this within an application, but you can also abstract some of these functionalities away into a service mesh. This isn't a statement on whether or not you should use service mesh or you shouldn't. The point is that most applications will end up using a little bit of either an internal configure as well as a service mesh. The idea is that if you have a lot of different services you plan on growing that you don't want to configure all of these different code bases. Then maybe consider doing a service mesh and abstraction. But if you're a developer and you're being asked to implement it, hopefully this provides a reasonable mapping of how you would do this in can application, but then how it impacts and changes as part of a service mesh. Now, if you want a very thorough example like the live one I showed today, you can feel free to go to this URL. It has all of the in depth configuration as well as the entire environment that you would need to set up. Hopefully it provides a deeper reference. If you have any questions about what the appropriate configurations are, you're more than welcome to reach out to me. I appreciate you tuning in to comp 42.

Rosemary Wang

Developer Advocate @ HashiCorp

Rosemary Wang's LinkedIn account Rosemary Wang's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways