SRE Best Practices for API Design

Video size:

Abstract

In modern development teams, site reliability engineers (SREs) are the glue that holds developer and operations teams.

It is the goal of the SREs to increase the reliability of their services to meet production standards by setting up monitoring, ensuring proper resource allocation, rolling out updates gradually, and anticipating the cost of failures.

As the size of the APIs increase, the need for making them reliable and robust also increases.

In this talk, Navendu will talk about best practices for API design that lean towards reliability. Attendees will learn about:

Reliability issues in traditional API design.
How SREs fit in the API development pipeline.
Modern API development best practices using API gateways.
How SREs can combine DevOps practices to build more reliable services.

Summary

Navendu is a developer advocate at API seven AI. He currently contributes to Apache API six, which is a cloud native API gateway. We will look into what reliability means for APIs and reliability issues in a traditional API design. If you have any operations or if you would like to discuss things further, feel free to reach out to me over at Twitter.
reliability is more than just the uptime. latency is also an important factor when it comes to reliability and security. In a traditional API architecture, if you want to improve the reliability of your services, you have to do something about it. What is the solution? API gateways.
There are a lot of vendor neutral and open source API gateways available. authentication and security, as we discussed in the earlier session of this talk, is quite essential. Rate limiting is also important, mainly because it avoids intentional misuse of your APIs. And it also helps improve the scalability as your API encounter traffic spikes.
monitoring and observability deals with tracing, logging and metrics. By monitoring what you can get is you can monitor your reliability metrics. And setting up monitoring also helps you to know when your API has failed. Later on we will discuss some circuit breaking mechanisms.
An API gateway can direct all of your traffic to an upstream can. This will ensure that your services stay up all the time. In case something fails, or in case there are some issues, you still have your previous upstream in standby. Circuit breaking is essential in modern software architectures.
What happens when you change the path of your API? How does it affect users and how can you ensure reliability in such a case? How do you let your client know that this is the new API endpoint?
reliability is more than just the uptime. It is also about consistency, availability, low latency, security and status. We also looked into API gateways and how they overcome the issues faced by the traditional API architectures. If you'd like to learn more, check out the Apache API six documentation.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi and welcome to this talk on SRE best practices for API design. I'm Navendu and in this session we will look into how development can development teams can build reliable APIs. We will look into what reliability means for APIs and reliability issues in a traditional API design. We will also look into how SRE fit into API development pipelines, and we will top it off with SRE and devop centric best practices for API development with an API gateway. Before we mourn a little bit about me I am Navendu and I'm a developer advocate at API seven AI. I currently contribute to Apache API six, which is a cloud native API gateway. I was also a cloud native computing foundation open source maintainer and I also help Google Summer of code and LFX mentees to help start their open source contribution journey. And you can reach out to me on Twitter, I'm mostly active there. If you have any operations or if you would like to discuss things further, feel free to reach out to me over at Twitter. All right, let's start the session by discussing about reliability. What does it mean to be reliable? So if you are a seller of an API, you might have slas, you might quote to your customers that your API is 99.9%, has a 99.9% uptime. But uptime can be a myopic view of what reliability entails. And even the case of uptime, it is kind of caused by making sure that your services don't crash. There is something more to uptime, or uptime is the result of some other factors. So what does it mean to be reliable? So when talking about reliability, a lot of teams get tossed around. These are consistency, especially in case of APIs. You need to have your APIs consistent so that the client applications can produce reproducible results with your API, and you need to make it available. So availability directly translates to what could be the uptime. So we want to make sure that your API is available all the time or as expected, and the consumers of the API don't have app crashes due to a lack of response from your API and low latency. So latency, a service with a high latency is almost equal to service that is not working. So basically for a client or for a consumer, it basically translates to a failed application. So latency is also an important factor when it comes to reliability and security. Secure APIs and secure services are what are like the pillars of reliability when it comes to API. And on top of that, you also need to ensuring you have status of your API so this goes for both the development teams and the consumers of the API. So both of them should be aware of how their API is performing and what is the status? Is it up right now or is there some redirects configured or that sort of things? We will look into this further. So I want to emphasize on the point that reliability is more than just the uptime. And for this talk I will use the term microservice loosely. It may need not be a cloud native microservice, it can be your application, servers or anything that is serving your API to your consumers. So traditionally you will have more than one client for your API, and I am representing it here by different applications running on different platforms. Yes. Let's look into some of the problems you face with traditional API architectures that will be of your concern as a site reliability engineer. So we talked about all the different pillars of reliability or different aspects of reliability, and if you want to improve the reliability of your services, you have to do something about it. So in a traditional API architecture, if you want to do something about it, what you will end up doing is you will have to configure or you'll have to add something new to each of your service, each of your endpoints. So these endpoints basically could be written in multiple programming languages. They could have been using different libraries, all sort of things. So it is not plug and play. It is more of a tedious job that can waste a lot of developer hours, and they are not centralized. So when you see something like this, it immediately pops into our mind that this should have been centralized. But in this case, as the API scales, you also have to scale your scale, the structure you have set up to ensure reliability, which is not feasible, which is not sustainable. So if you are setting up monitoring, you'll end up having to monitor every service or maybe every request to the service or every endpoint in the service. And if you want to set up security, the same goes, you will have to configure your security for each of your services. And if you want to set up something like an authentication, it is also not centralized and you will end up having to configure them directly on all your services, which needless to say, is a lot burden for the developer and for the maintenance team who works on it afterwards. And we can even imagine how difficult it would be to make new releases. So it will be a tiring job because you have to ensure very less downtime or zero downtime. And we want to ensure that no requests are interrupted while transitioning to this new version of the API. So from a traditional perspective, this seems too difficult to handle. What is the solution? What can we do to overcome this? That is where we introduce API gateways. So API gateways have been around for a really long time, ever since the API development model was popularized, and they have been widely gaining adoption ever since people started to moving from monoliths to microservice based architectures. So what do you mean by an API gateway and why should you care about it? Now, if we go back to our service, you have a lot of services and you have to end up configuring all of your observability configurations like monitoring, tracing, security, authentication and traffic control and all sort of things directly to your microservice. And that is where an API gateway steps in. So an API gateway acts as common entry point for all of your traffic. And in turn, an API gateway routes, it has some configurations, and based on that configuration, it routes the traffic back to your backend, back to your services. So an API gateways in essence abstracts out all the configuration you need on your APIs. So it abstracts out, when talking in terms of observability, it abstracts out all the burden from each of the individual services into one standards instance, and it can be managed centrally. So an API gateway does a lot of functions. So it manages authentication, it deals with your security, it can be configured to allow for monitoring and observability, and it can also be used for traffic control, among a lot of other things. So can API gateways is quite useful. And with that in mind, let's look at reliability, some of the reliability best practices for API gateways, and there are a lot of vendor neutral and open source API gateways available. As I might have mentioned, I am one of the maintainers of Apache API six project, which is also a cloud native API gateway. But throughout this talk, I'll be talking about API gateways from on a high level, and you can use any of the API gateways of your choice, or you can even go for cloud providers API gateways. So let's look at reliability best practices with these API gateways. So authentication and security, as we discussed in the earlier session of this talk, is quite essential. And the first thing is user authentication. So user authentication or authenticated rookies are a proven way to secure your client API interactions. And when it comes to monitoring authenticated, rookers also holds monitor your APIs in a very fine grained manner. The picture is self explanatory. We have all traffic routed through the API gateway and the API gateway will handle the authentication. So you can have basic authentication like a jot token or cookie in the header, or something basic to something like you can even use authentication providers like active directories and all sort of things, or maybe even authentication. So basically, API gateway takes care of all of your authentication needs, and once your client is authenticated, it can use the info gained from the authentication and it can be used for the algorithms in the service, or it can be used later on in your back end or in your services. And the next important aspect of security is rate limiting. This is something that some of you might not have thought in terms of reliability perspective. But rate limiting is also quite important, mainly because it avoids intentional or even unintentional misuse of your APIs, like a denial of service attacks. And it also helps improve the scalability as your API encounter traffic spikes, or mainly like quite uncertain traffic spikes. So rate limiting is quite important. So basically all of your requests will be routed through your API gateway. And if your services can't handle a set of requests, what you can do is you can block those requests. So if there are too many requests, they won't be processed and you can either reject those requests or you can either delay those requests. So based on your configuration or based on what you are trying to do, you can do either by reject, I mean you will entirely suspend those requests and you will probably return 500 range status code back to your client. Or you can maybe delay those requests if you can tolerate some level of latency, or if your client application can tolerate some level of latency, you can delay those requests until your services are able to handle those requests. So it is like a first come first priority. So you can work based on that and there are even other ways in which you can ensuring security and authentication. But I will leave that to you to explore and I will move on to the monitoring and observability part of our discussion. So monitoring and observability deals with tracing, logging and metrics. So we have our API gateway, and by monitoring what you can get is you can monitor your reliability metrics. We talked about some metrics and setting up some monitoring tool directly on your API gateway means you can monitor all of your traffic and you can monitor those traffic for your reliability metrics. And the API logs and your traces give detailed information of one particular request. So a trace tracks the entire request throughout your API, from your API gateway, through your services and back to the client. So post can give detailed information about the different reliability metrics and you can know how your API is performed. And setting up monitoring also helps you to know when your API has failed or know when there is an error. And instead of silently failing. With monitoring setup and alert setup, you can easily come in and fix it quite quickly and fix the system and get it up and running again. Later on we will discuss some circuit breaking mechanisms, but basically setting up monitoring can help, can go a long way. So it can also help in knowing your traffic. So when to scale, when to not scale, those kind of metrics are also key here as well. So going back tracing, we can set up logging and we can set up metrics. Now let's look at version control and zero downtime. Maybe this is more straightforward to think when it comes to reliability, especially in case of zero downtime. How do you ensure that your services stay up all the time? So let's first look at the version control aspect. So when you are releasing a new version of your API, how do you do that? So there is a release strategy called Canary release. So basically what you can do with an API gateway is it can direct all of your traffic to an upstream can. Upstream here represents all of your back end or your services. So you have an upstream on version one and you are trying to introduce a new version two, but you haven't tested it with production traffic before, so you want to ensure that it works perfectly before you deploy it completely. So you don't want to have to roll back to the previous version when something fails, you have to ensure that it will work. So initially what we will do is we will get all traffic to our API gateway and it will direct all traffic to can upstream to the initial version of our upstream. That is how it will be functioning normally. And when we have a new version ready, what we will do is we will direct few traffic, few of the traffic to the new upstream. So can API gateway can be configured to do this dynamically based on the results from this traffic to the new version. If it is working fine, you can slowly increase the traffic to the new version until we have all the traffic directed to the new version. So this will ensure that your services stay up all the time, and it will also ensure that the new version you have released works perfectly. And in case something fails, or in case there are some issues, you still have your previous upstream in standby, and you can go back to it quite easily by just changing the configuration in your API gateway. Now let's talk about circuit breaking. Circuit breaking seems like can electrical engineering concept, but circuit breaking is quite essential in modern software architectures. So basically you have your multiple upstreams. So all these upstreams does the same thing. So our API gateway acts as a load balancer for your upstream. And if one upstream service is unavailable, or maybe it is experiencing high latency, it needs to be cut off. Because if you don't cut it off, a rook is coming to the the failed upstream will be stagnated and it will cause resource exhaustion and the gateway or the service will keep trying the retrying the request. So what, this can cause a chain reaction and it can even cascade into all of your other upstreams. So your whole system may be in the way of in the domino tiles, so it needs to be cut off. So your upstream has gone down. And what the circuit breaking functionality of an API gateway does is it cuts off all traffic to your failed upstream, and it instead routes all traffic to your fully functioning upstream. And once the upstream is back, or once time has passed, what the API gateway does is it tries to check the status of the upstream, and if it is working fine, it can go back to the healthy state and the traffic can again be sent to this upstream and it can again be functioning as normal. And finally, there is also this case of reporting status, or creating new APIs, or changing APIs. So what happens when you change the path of your API? How does it affect users and how can you ensure reliability in such a case? So when a client is used to send requests to one particular path, and if the path is no longer there, or if you are trying to change the path for whatever reason, or maybe the services changed, or maybe things change. So basically what happens here is your old path is no longer there, but instead you have a new path, but the client user don't know this path directly. You either have to talk to them before, or provide documentation on this change, or something that is like that. But in most cases, this can be a tedious process to change the client code. So how do you handle such cases? How do you let your client know that this is the new API endpoint, and that is where an API gateway comes in. So in your normal use case, when you are going to the old API path, and the API gateway directs all traffic to your API endpoint, and when it is no longer there, what you can do is you can change the configuration of your API gateway to redirect traffic to this path from this path to your new path. So every time a user goes to this particular endpoint, the API gateway is configured to redirect the user to the new API. And you can even give a redirect status code before redirecting, and you can even send, let's say, a message saying that, okay, this old API path is being deprecated and this is the new path. Please change this and you can get on with that. But still, it will be backwards compatible as the users of the old API will still be able to access the new endpoint without having to change any of their client code. So let's wind up this session with a quick summary, and let's look at the key takeaway takeaways so we started this discussion by talking about reliability, and we decided that reliability is more than just the uptime, and it is also about consistency, availability, low latency, security and status. We also looked into API gateways and how they overcome the issues faced by the traditional API architectures. We also looked at how API gateways can help with best practices for reliability in areas of authentication and security monitoring and observability, motion control and zero dam time. That's it. And if you'd like to learn more, you can check out the Apache API six documentation, which is free and open source API gateway hosted by the Apache Software foundation. And there are also other API gateways free and open source available out there. And you can also reach out to me on Twitter. Here is my Twitter handle if you have any questions, or I'll also be hanging out in the Discord channel where you can ask questions. So thank you,

Slides

Download slides (PDF)

See all 33 talks at this event!

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

SRE Best Practices for API Design

Video size:

Abstract

Summary

Transcript

Slides

Navendu Pottekkat

Developer Advocate @ API7.ai

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

SRE Best Practices for API Design

Video size:

Abstract

Summary

Transcript

Slides

Navendu Pottekkat

Developer Advocate @ API7.ai

Join the community!