Conf42 Cloud Native 2022 - Online

Distributed application level RBAC with OPA

Video size:

Abstract

Sooner or later, every business needs to design their data and API authorization model with granularity over what their user can do. How do you do it in the world of distributed systems without disrupting the codebase or introducing a single point of failure?

In this talk we will present our journey of discovery for an efficient distributed solution leveraging OpenPolicy Agent, Go and other technologies, with everything running in the Kubernetes ecosystem.

Summary

  • Distributed rogue based access control with open policy agent. Serei Komachi, senior technical leader at Miaplatform. We had to change our authorization model from simple session validation to using codebase access control. We will dive into technical problems we faced with using a stack based on CNCF technologies.
  • Airbux is introducing errorback into its Kubernetes platform. The company wanted to be able to filter data before they were retrieved from the database. How do we make it resilient enough to very high request volumes?
  • In order to integrate with OPA and Rego policies, our service has to gather a few information, bundles them together, and eventually run the policy evaluation. Having all the policies centralized in a single place allows us to define Orego functions to abstract complex logics.
  • So for real, the sky is the limit okay, now the talk is over. I leave you here some links the first one is OPA. The second one is a blog post I wrote for mia platform. I would really appreciate if you leave a feedback, so please scan the correct code and submit the form.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone and welcome to my talk. Distributed rogue based access control with open policy agent I'm Serei Komachi, senior technical leader at Miaplatform and what I will present you is the work of the recent month when my team and I had to face a new challenge. We had to change our authorization model from simple session validation to using codebase access control. In fact, in this talk I will present you our journey of discovery for implementing a solution for errorback that worked for our use case, but is built to be easily enough and extensible to be generic and work well in any context and platform. We will dive into technical problems we faced with using a stack based on CNCF technologies and how these problems affected our design decisions. We will work a bit with Kubernetes open policy agent go, and a few more. Of course, in this talk we may reference to Miaplatform as it was the primary subject for our RBAC, but the same approach and consideration can be applied to any other context. Before going any farther, let's take a moment to review the basic of RBAC, which is an authorization model meant to group actions user can take in your platform into roles representing specific job functions. These roles are then used to be assigned to users. This model simplifies access control governance and makes your platform more secure, allowing for easier auditing and easier permission updates. Since you are limited to take action on a specific role rather than each individual user. Okay, that's enough. Introduction with the basic of errorback set, let's start our journey. Even though using RBAC seems pretty reasonable, more security, easy auditing and updates is cool. Errorback is not something you may see every day as it does not come for free. You either buy a ready made solution or invest much time and resources to build your own. So why did we need to introducing erbuck? Well, when we platform users using our product are capable of performing several different actions impacting their software application running on kubernetes. To say a few users can configure a microservice, define a new data model, deploy, monitor and scale their services on a runtime environment. As you can understand, it is not safe enough to let any user perform every actions available. For this reason we identified different personas, to say a few examples. We have owners that can do any actions they want. They are the owners. We have developers who may be able to perform limited actions on their configuration or deploy on certain runtime environments, and we have guests who have read only access to a specific subset of resources. Based on these personas, we identified. We then mapped each of our APIs that requested a different access level, and each of these APIs is mapped to a specific permission. We defined our own naming conventions for permission and eventually grouped our list of capabilities into roles. After defining our roles, we decided to take a step farther as we noticed that API access control has basically two different behaviors. In fact, we had a set of API that has to be completely blocked from unrestrict undesired access, while other APIs were open to many different users. But the data this user may see have to be different, either because we have to filter some list of documents or because some fields in the response body payload have some visibility restrictions. For these reasons, we defined three requirements our Airbux solution had to meet in order for us to be usable as we wanted to decide whether a request is allowed or not. So grant access to it or deny access to it. And we wanted to filter data in two different ways. We wanted to be able to filter data before they were retrieved from the database in order for our services to only operate with previously authorized data and we wanted to modify the response body in order to take much more granular control on the field visibility of their users. Okay, so we identified our user, our roles and our permission. We have the requirements set, we start coding, right? No, unlikely, no opening our editor and starting implementing our solution has to wait. In fact, we have to address two important concerns. First of all, where are we going to write our code and how are written RBAC decisions? And second concern is we know for sure that every single request in our platform will run through RBAC. So how do we make it resilient enough to very high request volumes? Let's try and answer these questions. So where is the code? Well, the first thing that came to our mind was this easiest one. Let's embed herback code into each service. That's a hard solution as we know for sure that we were incurring a lot of code duplication. So yeah, we can write some software library to abstract the complexity, right? Of course we can do write libraries, but in an application composed of different microservices written in different languages, we would have to write a lot of libraries. Also, we did not want to create a potential barrier for adopting of new technologies that needed the errorback library to be written first before they were usable. And eventually recoding errorback into each service would be extremely disruptive for our code base. We would have to change the code of many services and with the risk of introducing bugs and regression in existing code base, that risk is too much to be taken lightly. For these reasons, we decided that there should be some new component in our architecture that holds all the arabic code. Now for the second question, it gets a bit trickier because we decided to introducing a new component in our Kubernetes architecture and we know for sure that this component will be contacted for all the API requests. So how do we make it resilient? Okay, we could deploy a new service, centralized service, and horizontally scale it to sustain high volumes. However, we would incur into two problems. The first one being that we're introducing a single point of failure. No matter the scaling, if it goes down, everything is down. And the second problem is how is it invoked? Should every service invoke it? Again, that's disrupting for the code base. And so we took our second decision. Our backcode should be somehow intercepting requests and be distributed among the service that need it. Okay, now that we addressed our concerns, we can proceed with the design. Since we are running our application as pods in a Kubernetes cluster, we decided to adopt the sidecar container pattern. And so we deploy our erbux sidecar with all the services holding ownership on a specific resource. To make a clear example, the services that provide APIs for managing the configuration of a project is in charge of making airbag decisions for that resource. So it is that service that will block any attempt to change the configuration of a project by a user that has not enough permission to do so. Also, in order to operate the sitecare container, intercepts all the incoming requests, performs all the necessary authorization controls, and if everything is fine, the request is proxied to the application container. Otherwise the API is immediately rejected. Okay, but how is the sitecare built? Creating a new service allows us to be language agnostic and so we can adopt any language we prefer. We decided to use go as it helps us keeping a lower source consumption profile while being able to keep a high request throughput. Please note that these two are not the only factors that contributed in our language choice, but for now let's accept them as they are. And so the design is now clear without our buck, any request the user does to an API is received by the application container and the response is immediately returned to it. However, as soon as we introduce our back sitecare container, we are able to intercept each request and decide whether it should be rejected or allowed. Okay, at this point we were happy with the design, but we had to ask ourselves, does this design still meet our original requirements? Remember that we wanted to be able to assign job function roles such as guest developer maintainer to the users in order to decide whether to give access or not to a specific API in order to filter data before they were retrieved from the database, and to manipulate the response body in order to restrict even more data access. Luckily, the sitecare design doesn't provide any obstacle to these requirements. However, the sitecare itself must be implemented from scratch. All these requirements must be mapped into code and some configuration that let the sitecare knows how to behave. Based on each API we would have to write the service that receives some big configuration maps, every API with an action to take, which properties to filter, and so on and so forth. So that seemed a bit complicated and the team was worried and asked, do we really have to make this? Luckily, the answer was no. In fact, we decided to adopt open policy agent, which is an open source general purpose policy engine that can be used to perform pretty much any kind of query thanks to Rego, a declarative language designed for policymaking. Rego is a full fledged programming language, and so it allows us to write security policies as code, test them, and deploy production with it. OPA provide sdks in several languages to directly integrate it in your application, and it is here that their Golang SDK shines, allowing us to take full control of their engine. With direct SDK, we are able to do much more than simply running policies. We can also isolate the engine recompile policies to improve evaluation performance, prepare data stores, and create dynamic input information to be supplied to the running policies, allowing us to make decisions having everything we need to know from the API that we are protecting. So how is it actually made? As anticipated, in order to integrate with OPA and Rego policies, our service has to gather a few information, bundles them together, and eventually run the policy evaluation. To do so, it collects data from the request of the order response. Depending on the flow we are protecting, we take the headers, the complete URL and its parameters, and the request body. Then we fetch data from our user role binding database to be able to understand what the users can do based on the permission mapping. And eventually, after a few data preparation, OPA policy are loaded from a configuration and run using all previously bundled data. Okay, let's see some policy examples that can be used to solve our constraints and see if they're good enough for us to use. So the allowed policies example that we see here is an example of a grant or deny access policy. So this policy takes the project Id provided in the input request path parameters, defines an iterator, and then looks through each of the resource bindings that are defined in our role binding database. If there is a binding that is mapped to this specific project id and holds the permission project, view the policy results to true if any of this assertion is false. Otherwise the policy is rejected and the API is not proxy to the other container. In this other example, we see a query generation policy which is a bit more trickier as it uses the OPA concept of unknown data to be able for the policy to return a set of variables definition in the form of the assertion you see here. These definitions are later used by the service to generate a query that is provided to the API underneath and so it can filter data. In this last example, we are running the policy in the response flow and we are able to provide rego with the regional response body received from the application service. The policy can then manipulate it. In this example, we are going to drop the sales forecast property, for instance from each document in the list using a simple list comprehension okay, we are at the end of this talk, but before saying goodbye, let's share some final thoughts on this design with this solution. In fact, we were able to create a platform that complies with many best practices. In fact, by writing security policy as code, we are able to test them and make sure we don't introduce any regression in our authorization model. Also, having all the policies centralized in a single place allows us to define Orego functions to abstract complex logics and always have an overall overview on what our policies do. Also, from a scaling and high availability perspective, even though we introduced a new hope in the request call chain, we measured a very low latency introduction and having a background as a sidecar gives us the possibility to scale and boost resources on the most requested services while keeping other with tighter resources. Also, thanks to OPA and Rego, we can express any kind of security policy we want. So for real, the sky is the limit okay, now the talk is over and I leave you here some links the first one is OPA. It is an amazing tool as you may have seen, and I strongly recommend you checking it out if you don't know it. The second one is a blog post I wrote for mia platform and it dives into more details about our use case, so if you're interested, you can check it out too. Now, whether you like it or not, these talk I would really appreciate if you leave a feedback, so please scan the correct code and submit the form. Thank you for your time. I really hope that you found this talk interesting and if you wish to follow me I don't use much social network, but you can find me on GitHub and LinkedIn. I am even on Discord. So if you have any questions, you can find me there now. Thanks a lot,
...

Federico Maggi

Senior Technical Leader @ Mia-Platform

Federico Maggi's LinkedIn account Federico Maggi's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways