Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Are you an SRE, a developer,
            
            
            
              a quality engineer who wants to tackle the challenge of improving
            
            
            
              reliability in your DevOps? You can enable your DevOps
            
            
            
              for reliability with chaos native. Create your
            
            
            
              free account at Chaos native Litmus cloud
            
            
            
              hello everyone. Happy to be here. I'm here to talk about how to empower your
            
            
            
              developers to troubleshoot Kubernetes independently. My name
            
            
            
              is Itiel Shwartz and I'm the CTO and co founder of Commodore,
            
            
            
              the first Kubernetes native troubleshooting platform. I really believe in
            
            
            
              the shiftlab movement in dev empowerment.
            
            
            
              Other than that, in the past I worked in eBay, Forter Rookout and
            
            
            
              I have a lot of backend and opsit experience
            
            
            
              and also I'm a Kubernetes fan. So the agenda
            
            
            
              for today is first of all we are going to talk about the benefits and
            
            
            
              challenges of Kubernetes. I know everyone knows Kubernetes,
            
            
            
              but I'm going to do a little bit more of like a deep dive on
            
            
            
              what's so good and what's not that good in kubernetes.
            
            
            
              We are going to talk about complexity of troubleshooting Kubernetes
            
            
            
              and how to overcome them. So first of all, I think everyone
            
            
            
              knows that move fast and break things is kind of
            
            
            
              the new motto for the past couple of years. We are
            
            
            
              shipping code a lot faster and elite teams are moving even
            
            
            
              faster than normal teams, meaning 100, even thousands
            
            
            
              of deployments per day in this new
            
            
            
              world, in this new space, what happened is in order to move
            
            
            
              fast, new infrastructure has emerged and docker
            
            
            
              and Kubernetes are de facto the standard in this new world.
            
            
            
              And like 88% of organizations already adopted
            
            
            
              Kubernetes. So I know everything sounds
            
            
            
              amazing and Kubernetes has a lot of benefits. It help you save
            
            
            
              money, it's easy to port multicloud capabilities,
            
            
            
              a lot of things and it sounds like it's
            
            
            
              doing everything. But in the end of the day it
            
            
            
              does come with its price. And I think every organization
            
            
            
              that is moving to Kubernetes because it looks sexy,
            
            
            
              because it will help make their life a lot easier at
            
            
            
              the beginning of the migration is bombarded with issues
            
            
            
              that arise because they move to kubernetes. It looks very
            
            
            
              easy, very naive to get started, but once you are in
            
            
            
              there, you feel the pain, you feel how hard it is, how complex
            
            
            
              Kubernetes really is. And also the
            
            
            
              trick about kubernetes is it makes it very easy to build very complex
            
            
            
              systems and basically it allows you to
            
            
            
              shoot yourself in the leg very easily.
            
            
            
              So the complexity of troubleshooting and troubleshooting Kubernetes
            
            
            
              is basically because a lot of things happen in the Kubernetes cluster
            
            
            
              that you don't really know. The master can change,
            
            
            
              the node can change, and not all of those changes, all of those issues
            
            
            
              are propagated, are propagated. More than
            
            
            
              that, one issue, one issue with microservice can
            
            
            
              impact the rest of the system. And because with the move
            
            
            
              to Kubernetes, we are moving from a monolith application to a lot
            
            
            
              of microservices, a very small and innocent change in
            
            
            
              one service might crash the whole application.
            
            
            
              Other than that, there are permission issues and some organizations
            
            
            
              don't really trust everyone to see all of the data.
            
            
            
              What causes basically the lack of ability?
            
            
            
              CTO act independently for some team members and particularly
            
            
            
              for developers. And because Kubernetes is still new and not
            
            
            
              a lot of people are really experts in Kubernetes,
            
            
            
              we see how the lack of knowledge is frustrating
            
            
            
              for developers and for teams, and they
            
            
            
              don't really understand how they should operate this
            
            
            
              very complex system and how are they expected
            
            
            
              to own the services even they don't really know what is happening
            
            
            
              under the hood. More than that, there is the
            
            
            
              very big question of who is responsible for troubleshooting issues
            
            
            
              in Kubernetes. I think if we
            
            
            
              ask the same question a couple of years ago it was clear
            
            
            
              that the knock and opsit responsibility is
            
            
            
              to manage things in production. Now with the shift level
            
            
            
              movement in a world where for each dev there
            
            
            
              is one DevOps, we see more and more organization trying to
            
            
            
              shift this responsibility from the DevOps to
            
            
            
              the developer. Mainly because there are a lot of developers, they are already
            
            
            
              responsible for deploying the code into production.
            
            
            
              And if they are already deploying the code into production, then it
            
            
            
              only makes sense that they will be responsible for
            
            
            
              the troubleshooting, for the ability not only CTO
            
            
            
              break things, but also to fix them and to understand
            
            
            
              what did it break and why did it break.
            
            
            
              So the question is, if kubernetes is indeed complex,
            
            
            
              and if we want our developers to help us troubleshoot
            
            
            
              Kubernetes, we want them to get a full ownership
            
            
            
              over their application, like full end to end, from deployment
            
            
            
              to troubleshooting, what do we need to do as operations as
            
            
            
              SRE in order to allow them, in order to
            
            
            
              make them part of the troubleshooting cycle, and not only
            
            
            
              bystanders. So the step one,
            
            
            
              and like a lot of organizations,
            
            
            
              when we speak with organization, we see how they try to throw everything
            
            
            
              to the developers. Basically. I don't know you are now
            
            
            
              responsible for it. We brought Kubernetes into the organization so
            
            
            
              the dev can have more empowerment. It's your problem now.
            
            
            
              But I think one of the key things that you must start with
            
            
            
              is to get the buy in from the dev. You need the dev and you
            
            
            
              don't necessarily have to get all of the dev organization. Even a few
            
            
            
              champions at first are good enough. You need people that
            
            
            
              will want to troubleshoot and they want to troubleshoot not
            
            
            
              because they're like Mazoheks, but because
            
            
            
              it will give them more ownership, more responsibilities and
            
            
            
              more independence. So you need to understand how
            
            
            
              to harness the developers through the journey of troubleshooting,
            
            
            
              how you can make them part of the team and not
            
            
            
              something that they are reluctant to
            
            
            
              do. If you expect them to wake up in the middle of the night and
            
            
            
              to understand what is happening in the Kubernetes cluster, you have
            
            
            
              to get the bind, you have to get them engaged,
            
            
            
              and if not, it's not really going to work after you
            
            
            
              get them engaged. And now everyone really wants to troubleshoot.
            
            
            
              You need to spend the time on the training part. A lot
            
            
            
              of developers, developers are measured,
            
            
            
              or used to be measured on the quality of the code they
            
            
            
              write and the number of features they write, not necessarily on
            
            
            
              how good they are at troubleshooting. They have a very micro
            
            
            
              and aeroscope. I don't like to say every developer
            
            
            
              is like this, but as a whole, while the operation people that
            
            
            
              troubleshoot has a more of a macro level view. And what you need to do
            
            
            
              is help them to understand the world as you see
            
            
            
              to write the playbook, to go in and train
            
            
            
              them so they will have all of the necessary knowledge
            
            
            
              in order to troubleshoot independently. You can't expect to
            
            
            
              say, okay, you're a developer, I see that you want to troubleshoot, knock yourself
            
            
            
              up. You should help them CTO, make the
            
            
            
              journey a lot easier by giving them some of the knowledge
            
            
            
              that you already have about both the technical parts of kubernetes
            
            
            
              and also about the company specific parts of how is Kubernetes
            
            
            
              managed? In my organization,
            
            
            
              the step three is to give them the tools they need to succeed.
            
            
            
              Again, I will say that most monitoring
            
            
            
              tools, most troubleshooting tools like today are more
            
            
            
              focused on the operation. People like Commodore is can exception,
            
            
            
              but I will say most of them are macro level
            
            
            
              operation focused. You need to help them have
            
            
            
              the right tools in order to troubleshoot efficiently. And that
            
            
            
              means creating them the relevant dashboard in Datadog training to
            
            
            
              them, how they can add it or change this in Datadog or Prometheus,
            
            
            
              it doesn't really matter. You have to make sure that once an issue
            
            
            
              occurring in the system, they have the relevant places
            
            
            
              and tools to look at. They should have the
            
            
            
              required permissions to go, but also it
            
            
            
              should be part of the training on understanding CTO them how
            
            
            
              to use those certain tools. And once
            
            
            
              we have all of these tools, we must give them the permission. You can
            
            
            
              really experts someone CTO be responsible
            
            
            
              for the production environment. When he has zero production
            
            
            
              accessibility. He can't see the logs or he can't do actions such
            
            
            
              as rollback or increase replicas and so on. So if you
            
            
            
              experts them to wake up in the middle of the night troubleshooting the issue,
            
            
            
              you must give them not only the necessary tools, but more than the permission
            
            
            
              and the abilities to do actions
            
            
            
              in order to remediate the issue. One of the most frustrating things that
            
            
            
              I see in organization is that the developers can
            
            
            
              deploy the code to production, but for some reason they can't really roll back
            
            
            
              to the previous version. And that in turn just
            
            
            
              creates frustration for the developers and for the DevOps.
            
            
            
              Like everyone is not happy because the DevOps
            
            
            
              is required CTo be in the loop every time we need CTO
            
            
            
              roll back. And on the other hand, the developers need to call up
            
            
            
              the DevOps just so you can click the button. So free yourself,
            
            
            
              open the bottlenecks and allow the developers CTO take
            
            
            
              full responsibility over the lifecycle of
            
            
            
              the system. And I will say
            
            
            
              that we are talking about troubleshooting, about the
            
            
            
              procedures, about the training, about the tool. It is important
            
            
            
              to make sure you are a single team,
            
            
            
              that the core mission is to make the system better and
            
            
            
              to improve the troubleshooting process and time. So you needed
            
            
            
              to find the relevant partners. It can be even one or two tech
            
            
            
              lead in the dev organization to get started. You need
            
            
            
              to remember this is a marathon. It's not a sprint. It's not
            
            
            
              like one silver of a bullet. And all of your developers
            
            
            
              will be just as good as you are. You need to remember, it will take
            
            
            
              time. You will need to change some of the playbooks, you will need
            
            
            
              to change some of the tools, you will need to adapt them. You will maybe
            
            
            
              need to create new actions, a new access mechanism to
            
            
            
              accommodate the fact that the developers are now owning the troubleshooting
            
            
            
              process. But I don't think you can really
            
            
            
              take it as a bummer. Like now I need to write tools
            
            
            
              for the developers to use or something like that. The opposite.
            
            
            
              You are now writing tools that will empower your developer and
            
            
            
              in this way you are basically freeing yourself out of
            
            
            
              troubleshooting and out of waking up in the middle of the night.
            
            
            
              I think one thing that good
            
            
            
              operation and developers love to do is to automate themselves
            
            
            
              out of the process. And this is the state of mind that you
            
            
            
              need to think, how can I be less involved in troubleshooting?
            
            
            
              How can I give more tools, more capabilities to
            
            
            
              the developers? So to conclude
            
            
            
              the talk today, I will say that it's important when starting to
            
            
            
              adopt Kubernetes to think about all of these things.
            
            
            
              You can really expect to move everything from the
            
            
            
              legacy system into Kubernetes and expect things to work normally.
            
            
            
              You should try and build the right foundation from the ground
            
            
            
              up, meaning have the dev involved in picking the technologies,
            
            
            
              in troubleshooting from day one and getting the bind
            
            
            
              from them. More than that, I will say that
            
            
            
              in the end of the day this will increase the velocity of
            
            
            
              everyone in the team, both the developers, because they
            
            
            
              will be able to write code fetcher and once they have issue they will
            
            
            
              be able to solve it by themselves. And it will also help
            
            
            
              to free the DevOps and the SRE because they
            
            
            
              won't be bombarded with developers that are like why is
            
            
            
              it not working? Can you fix my application? My application is crashing
            
            
            
              and so on. So both the developers and operations
            
            
            
              should be happy about this process as it will make
            
            
            
              both of them much more efficient and it will save time for
            
            
            
              everyone. So I think having the developer as a
            
            
            
              critical part of the troubleshooting process is a must for
            
            
            
              every organization that want CTO move fast. And I think
            
            
            
              that it's not that hard, but it is a process,
            
            
            
              it will take time and you need to understand that even
            
            
            
              it might be like a bit hard at first. Over the
            
            
            
              long term it's worth your time.
            
            
            
              And that was me talking about troubleshooting
            
            
            
              Kubernetes. It was really fun being here. Thank you a lot
            
            
            
              everyone.