Conf42 Platform Engineering 2023 - Online

Building a Platform with self-serve powers using ChatOps and Github Bots

Video size:

Abstract

We will go through how we built Mattermost Cloud with the mindset Platform as a Product and utilise common services for the IDP. In parallel we will present how we empowered the Platform with self-serve capabilities for all - not only devs - using ChatOps and Github bots.

Summary

  • Today we will go through how we build the platform's product to serve internal and external customers with high sellers expectation. How we can use chatops and GitHub bots to boost developer productivity and services in parallel engineering and non engineering teams in mattermost.
  • The developer experience should happen only with the tools we use daily. We decided to use for GitHub the GitHub bots and for Mattermost to use actually that setups model. GitHub bots or apps are used to automate and improve workflows. The SaaS platform should be agnostic.
  • Chatops is a collaboration model that connects people, tools, process and automation into a transparent workflow. Mattermost can interact with bots and send common to initiate workflows in the platform tools. All of them interact with IDP right seamlessly without knowing exactly what's happening underneath.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome. I'm really excited to be here. Today we will go through how we build the platform's product to serve internal and external customers with high sellers expectation and how we can use chatops and GitHub bots to boost developer productivity and services in parallel engineering and non engineering teams in mattermost. So before do a deep dive we need to understand what is Mattermost as some concepts are based on Mattermost and Mattermost is an open source platform that provides secure collaboration for technical and operational teams. And we'll go right now through the mattermost cloud story and how we built a SaaS platform. Back in the days we had an idea to offer maternals as a SaaS product and built maternals cloud and we initiated all the software development lifecycle. So right now we'll go through this lifecycle of maternals cloud and identify in each phase actually what were the details, what was happening around that and the decision made. First we start with a planning try to identify what is the current team's capacity. Do we need more resources higher? Do we have the right skill set? Is the team structure the current team structure enough in the responsibilities? What will be the cost of the development and running and operating such a platform and the milestones to launch Matrix cloud? At that time we had adopted the devs and tops collaboration model with each team specializing where needed, but also same were necessary. So there was collaboration between them and then we passed to the next phase which were the requirements. In this phase, actually we did a deep dive on to understand what is the ideal customer profile, who's going to use our product, the features, the user flows, the slos and the slas. We need to support our SLA agreements with the customers that we need an incident response that we need to focus on scalability and performance to host thousands of workspaces and customers from mattermost and security and compliance. At that time we realized that we need to introduce an SRE team and the collaboration between the dev and SRE happens around operational criteria. So when the SRE team is happy with the code and the production and readiness reviews have been approved, then they are able to support it in production. So we introduced four new teams based on the requirements and the planning we had. Initially, SRM will be responsible to build and operate a reliable, scalable, secure and cost efficient SaaS platform. Automation and tooling will be actually the team to build the automation, the tools to orchestrate and manage the fleet of the customer workspaces. Delivery will be responsible in terms of the CI, CD and all the release lifecycle and the release cadence for our cloud customers. And we came with the strategy, okay, we're going to build a SaaS platform which is going to be used by customers. They're going to host their maternals workspace in our managed infrastructure. So we moved to the design phase and we tried to identify a couple of things there. What will be the technology stack, the core components, the system architecture, security controls, the compliance and data stations, the testing. So a couple of things, and not only this, this is just a high level and some examples of some decisions were already in AWS and we decided that we need to use a mana database, RDS Aurora we decided that we want to use Kubernetes to host easily thousands of mattermost workspace and customers operate to simplify the deployment of mattermost workspace customers services on the customer portal to interact with the customers, the provisioner and the fleet controller, which we're going to discuss in a few minutes. So we came up with this high level architecture. So you can see this is a wholesale platform. And what you can see here is that we have a command and control kubernetes cluster and a couple of worker kubernetes cluster which bots actually the customer workspaces. And the flow starts actually with customers visiting the front end, which is a customer portal, interact with the customer services, the customer portal and instruct provisioner actually, which is a central point of management for mattermost cloud resources to dictate and command operator in the wartier Kubernetes cluster to create a mattermost workspace for a specific customer. And we see also the flip controller which is actually a service possible for tracking matter most workspaces and making tweaks such as hibernating or deleting inactive workspaces. So it's mostly about housekeeping. So as we discussed few minutes before, we identified also the testing strategy and we realized that to test the SAS platform at scale, it will need bots of effort and will be really complex and will be hard automate. Everything will be cost and efficient because we'll run multiple workspaces and have daily user and so on, and will increase the time to market. That's really important for us to launch mattermost cloud fast. So one of our core principles in mattermost is to actually use our own product to complete our mission. And this is actually what we call, it's doc fooding. Right? So simply explain is that you will be able to understand exactly what it's like to use your product and the way that your users do actually if you use it right. And this is the time where our motto as a group has been created, built once and for all. And this is where actually also defined our mission as a group. The infrastructure group's mission empowered matter was to provide the SaaS platform as a product which helps internal and external users by guaranteeing that we operate an enterprise grade building, building, building, building a platform with selfserve powers, actually the mission of the group. So we find our testing strategy and we decided that we should not over automate and we need to prioritize only the customer experience and some of the critical paths for automation around integration. End to end testing and use a SaaS platform as a golden path to run a dog food. Actually the platform with testing devs ends, the support ends and the demo ends. And we will see how we can do that. So the new strategy actually was to offer the SaaS platform to two customers, actually the external customers that were before, and the internal customers, developers, sales, pH support, Devrel. And let's think that the internal customers, actually the early adopters, right, they are getting the latest and the greatest in the whole platform fast. So we can test whatever we deliver, all the changes we do in actually a real to identical actually platform in a test environment. So the new high level architecture has been changed a bit. So actually right now we have the same stuff, but you will see that we will include a few more things. So we have right now the customer control plane, which is about the customers interact with the customer portal and in parallel we have the developers control plane, as you can see in the left side where engineers interact with GitHub and Mattermost and with an internal developer platform layer, as you see. And apart from this we have the non developers control plane where non engineers, non developers actually use mattermost and interact with the internal developer platform and they enable all of them to use such a platform, which is a test environment and run their own workspaces in our clusters. So let's start with our developers control plane goals. We wanted to give them the self services capabilities to be able to do this by themselves and run test environments and dev environments actually in the platform. We wanted to support for staff and for open source contributors. We are an open source company and we inspect the open source community a lot. The developer experience should happen only with the tools we use daily actually. And we should not introduce something new to increase the cognitive learning chair. And we wanted to make the platform and the interaction as much a subject as we can, actually non developers control plane I would say there are not much differences there. I would say it's mostly difference in the sense that we support sales support in Devrel. Still we have the surface self capabilities, still the user experience should be with the tools we use daily and abstraction the SaaS platform should be agnostic. So we wanted to provide the seamless experience. And as we see there are some common things. We have mattermost where we collaborate, we communicate and we talk daily and we use it. And we use it actually most of our day. So both groups, developers and non developers, actually they use mattermost. But there is one more thing which is GitHub, including that, and we want actually to use the same interface as they had before. So we decided to use for GitHub the GitHub bots and for Mattermost to use actually that setups model which we're going to discuss in a few minutes. What is it? GitHub bots GitHub bots or apps actually are used to automate and improve workflows. It's a small services which interact with GitHub webhook events. And for that reason actually we created a GitHub bot which is called Spinweek. And what it does actually observes the GitHub labels in terms of the webhook event and accordingly deploys mattermost service in mattermost cloud. So we have different type of test services using the customer portal, through the CWS, through the customer web services and using provisioner. So the design level architecture is actually if we do a zoom in in the IDP right, there is another cluster which is a couple of services running. Then in this case we see Spinwick and interacts with GitHub. When something is happening in GitHub, Spinwick actually listens the changes in terms of the labels of the labels has been added or removed and interacts back with GitHub to provide the context back to your user, what exactly is happening, what is the status and so on. And Spinnywick still actually interacts with custom server to do what we discuss capability through the custom server and the other one for provisional, both of them, they will do actually a test environment, a test workspace in our test worker Kubernetes clusters. And let's see the self serve capabilities of Spinweak. And you can see here there is a pull request which has been raised right now we want to test in the cloud test server. You can see in the right side where the arrow actually is placed that we have a label set up cloud test server or set up hack cloud test services or set up cloud and CWS test server actually, all these are actually triggering the flow we discussed before in the high level architecture where a cloud server is created. And let's see actually how the whole interaction happens. So you can see here that I just added a level which is called setup cloud test server. So for the experience, right, and to make clear to the user that something is happening, Spinnyweeks is replying back with a comment to GitHub and says that right now we're creating a new Spinnyweek test server using maternals cloud. A few minutes after, actually 1 minute after, you will see another comment by Spinweek which says that the mosque services has been created successfully. Here is the access link and huge delights the name, the pull request numbers, and here are the credentials which are common for everyone to use in order to be able to log in in your mattermost test server. But it's not only about creation, of course, it's about also removing the environment. So if I remove the label, it's going to destroy the server. If I melt or close the pr, the test server also is going to be destroyed. And that's the good thing. Actually, the good part we won't discuss right now that why you need to have a common SaaS platform and set up both customers is that we discussed a few minutes in the high level architecture about the fleet controller. And let's say that someone has a pull request which is running for many days. You have created a test server and it's there. So imagine that if we have multiple polyquests which are actually in that state, then we will have multiple test servers just sitting there and not doing anything. So the fleet controller is responsible for the housekeeping as we do with the customer experience, the same thing. So if there is an activity after a few days, go to the hibernate state, and if after this hibernate state there is no activity, it's going to be deleted. So the housekeeping is exactly the same as we offer to our customers, right when they do trials. This make the engineers a bit more hungry and they want to automate a few more things. So we created another bots which is a thing which used to automate workflows in GitHub, again using labels, using slash commands in comments for a pull request, adding labels for housekeeping into issues in prs and let's see some examples. So again, it's a self serve and we use the slash common when someone is coming to mattermost as an open source contributor and wants actually to contribute to Mattermost, they need to sign their mattermost contributor agreement and you see on the left side and the left image that someone has been raised a pull request and got a comment back which says that you need to sign actually the mattermost agreement. And once you have signed, just run slash check CLA to confirm that CLA is okay and green and you can see that someone has signed and runs after this in a comment check CLA and automatically you will see on the right side that there is a status check which says CLA matter mode is green. So mother mode has another option which is update branch. We can just update a pull request with the latest branch which is targeted to merge. For example, if I'm targeting main in my pull request, if I run Slash update branch, I'm going to actually merge the latest changes from main. And you can see the example here that someone wrote actually a comment update branch and this automatically mattermost mode actually mert all the changes which are included in the main branch. There are a few other things. There is also a few other common terry pick which is very handy for us in order to use releases. So we want pull requests to be terrific in other multiple branches for bug fixes or for improvements. We have also slash commands to run end to end tests which are running in the SAS platform, which is another thing we try to do. So we have the same SAS platform to run the end to end tests, and there is an end to end cancel which cancels the end to end test if something is really slow. There's also the housekeeping part we discussed. If a pull request has an activity for a specific duration actually automatically, Mattermost will see this pull request and will label them as state. So it will be easier for us to go through this pull request and see if something we need to add or something we need to check. So we discussed few minutes before that, one of our great mattermost was chatops. Chatops is a collaboration model that connects people, tools, process and automation into a transparent workflow. And just to understand what exactly we talk about, if we're in a communication platform like Mattermost, right, a collaboration platform without satos, we need to communicate inside for all the things we do together, even if we are engineers or non engineers, and use something else to trigger a workflow or platform tool. So we tried to make this similarly with chatops. The engineers and non engineers in a way communicate in chat, and mattermost can interact with bots and send common to initiate workflows in the platform tools. So everything happens similarly in the platform URL, especially for the non engineers. The non developers control plane is really important as it's the number one thing they use daily setups with Mattermost. Actually, we offer a bunch of options to the functionality and customized experience with satos. There is the slash commands and the plugins. And for our chase, actually, we built the cloud plugin. The cloud bot allows the creation and the management of a test end in the SAS platform directly from mattermost using slash formats from any channel. Right? You can do this from any channel you are, or the DM to the bot, actually, and just a small example, let's say that we are going to create a conf fourty two. The cloud create test configurable 42 is the one in which we need to actually to write down in mattermost, and this will return back to us that installation has been initiated. You will see an application when it's ready, and you can see the status of all cloud installation cloud list. And when everything is ready, the cloudboard will reply back to you with a DM that then the workspace has been created that SN is there. This is the access URL actually, and credentials we can use. And you can see something extra, which is the part we didn't discuss in this presentation about the monitoring control plane, where we can offer a monitoring control plane for everyone, where they can see their workspace logs and the provisional locks in case if something went wrong. So we didn't want to stop there. So we wanted also to offer the capability to be able to configure and compose your own environment so you can do a couple of things with a slash cloud. So we're creating another one which is called Cloudwick and has a specific license. It can generate some test data. It uses a specific size in terms of resources for the high availability and a specific file. This gives us an ability for the support team, for example, or for the sales team, or for the Devrel team to create different kind of environments based on the users they want to make and the use cases they have in the scenarios, right? So right now, if we go back to the architecture and focus at least on the part which is the IDP, right, and how all the control planes are working together, we still have the control plane, the customer control plane, where the customers interact and they create their own workspace, their managed workspace, and the mattermost workspace. And from the left side we have the two more control planes, which is the developers control plane and the non developers control plane, which they use both as an interface mattermost, and engineers only GitHub. All of them interact with IDP right seamlessly without knowing exactly what's happening underneath. And right now we can see, right, that IDP consists of multiple many services. The Spain which cut before cloudboard, Mattermost mode, motherboard is another one. We have a couple of other actually bots for doing the same thing in small services with GitHub and Mattermost. And this is actually our internal developer platform story and the story which relied actually how we built once and for all, right? For both world's customer and the internal organization. Learnings, the flexibility and the reusability of the same track give us much more confidence to be sure that what we deliver daily and the changes we had during the whole software development lifecycle was stable, was able to perform, identify fast bugs of the issue, even identify non healthy mattermost workspaces of the test ends. Because we had the same mentality of the slos, even in the test infrastructure and the test trust platform, it was really important for us and was a good learning that the developer and user experience needs to be seamless instead of creating something new and something which is out of the daily usage of the people that are going to interact with your platform, run surveys, gather feedback and listen to your internal and external customers is really important thing. This is how actually the platform becomes better. And the last thing which we mentioned also is use your own product. This is the only way to identify if what you have built is something actually meaningful for the customers and if it fits to your needs and it's something which you can use, probably customers also can use. Of course this needs other inputs in terms of feedback, slos and other things. So that's it. Thanks a lot, have a great rest of the day and have a great conference.
...

Spiros Economakis

Head of Cloud Infrastructure @ Mattermost

Spiros Economakis's LinkedIn account Spiros Economakis's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways