Conf42 Chaos Engineering 2023 - Online

Building a more robust Apache APISIX Ingress controller with Litmus Chaos

Video size:

Abstract

Both Litmus and APISIX Ingress are very active open-source projects. Combining these two projects, we have completed many exciting Chaos experiments and built a more robust project.

Summary

  • Chaos Engineering is a process of evaluating software system by simulation destructive events. It can also help teams simulate real world scenarios in a security control environment to uncover hidden risk and identify performance botnecks in distributed system. This approach is an effective way to prevent system downtime or production interactions.
  • Ingress is a resource object in Kubernetes. It contains rules for how client outside the cluster can excise the service inside these cluster. The Ingress controller translator the ingress rules into configuration on a proxy. Different ingress controller have different implementations.
  • Litmus Chaos is an open source chaos engineering framework and incubating project of the CNCF. It provides an infrastructure experiments framework to validate the stability of controllers and microservice architecture. Kubernetes development developers and sres use litmus to manage cows in a declarative manner.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Today I would like to discuss how to build a more robust patch API six ingress controller with litmus Chaos. Let me introduce myself first. I'm Jin two, an Apache API six PMC member, maintainer of the Kubernetes Ingress NJX project, and Microsoft MVP. If you would like to get in touch with me, you can find my GitHub profile and email address on the slides in this talk. The agenda is to discuss why we need Chaos engineering, how to design chaos experiments for an ingress controller, how to practice it, the benefits and the future of this field. First, why we need Chaos Engineering let's review the definition of chaos engineering. Chaos Engineering is a process of evaluating software system by simulation destructive events such as server network outage or APisix routing. In this process, we test the system's resilience and reliability in unstable and unexpected condition by introduced chaos, for example, server fraud. Chaos engineering can also help teams simulate real world scenarios in a security control environment to uncover hidden risk and identify performance botnecks in distributed system. This approach is an effective way to prevent system downtime or production interactions. Netflix approach to handling system inspired us to take a more scientistic approach that driven the boss and development of the chaos engineering what is Chaos engineering? I think the first one is introduce the introduction of the disruptive events. Cows engineering involves introducing disruptive events such as network partitions, service degradation and resource constraints. Two simulate real world scenarios and these these system's ability to handle unexpected condition. The purpose of this is to identify and weakness and use the information to improve the system's design and the architecture. Make it more robust and resilient. Then test the system's resilience. Today's technical landscape is constantly involved and fast phased to ensure the system are robust, scalable and able to hand unexpected challenge and conditions. It's very important to test the system silence in real worlds. Chaos engineering is an effective way to do this. It involves introducing disruptive events to observe the system's response and machine ability to hand unexpected expected condition to measure these impact of the disruptive event on the system. Resilience organization can monitor system logs, performance metrics and user experiments. By tracking these metrics, organizations can gain a better understanding of the system's behave and identify area of the improvement. Next one is discovering hidden problems. Distributed system can be prone to hidden issues such as data loss, performance botnecks and communication errors. These problems can be hard to detect as they often only become visible when these system is underpriced. Chaos engineering can help uncover this hidden issue by introducing disruptive events. This information can these be used to improve the system's design and architecture, making it more reliable. By identify and resolve this problem, organization can enhance the ability and performance of the system of these system. These can help prevent downtime, reduce the risk of data loss and opensource the system continue two run smoothly. What is worse and why we need it? Why we need it? First, distributed system are complex with many inherent cows in the system. The use of cloud and micro surface architecture provide us with many advantages, but it also comes up with completed, completed and chaos which can lead two failure. The engineer's responsibility is to make the system as reliable as possible. Without testing, we have no confidence to let our product be used in production environment in order to make it more robust. In addition to the conventional unit test, we decide to introduce kelse test when an error occurred. Repelling it takes time and can cause immeasurable loss which may have long term effect in the future. In the process of the repair, we need to consider various facts, include the complex of the system, the type of the error, and possible new problem in order to ensure that the final repair is effective. Moreover, when an opensource project bring serious faults to the user in the production environment, many user will choose to switch as a product. Back to today's topic, how to design Kelsey's experiments for an Ingress controller. Let's talk about what is Ingress first, ingress is a resource object in Kubernetes. It contains rules for how client outside the cluster can excise the service inside these cluster. These rules include which client can access which service, how to root client request to the service, and how to hand these client request. On the right is a simple example. As you can see, ingress is a very simple resource. No need to make it more complicated than it needs to be. These what is Ingress controller? An ingress resource requires an ingress controller to precise h. Otherwise it has no practical use. The Ingress controller translator the ingress rules into configuration on a proxy allow external clients to access service within these cluster. Ingress controller is a specific type of load balance that receives ingress rules from the cluster and then translates them. Two configuration that can proxy client rules. This effectively manages how external external clients excise service with the cluster. However, in a production environment, we need more complex capabilities such as limiting access opensource and request method, authentication and authorization. The ingress resource object doesn't include this part, so most ingress controller extend the semantics of the ingress through annotations in the ingress resource. Different ingress controller have different implementations. For example, the annotation used by Kubernetes Ingress NJX and Apache API six ingress are different. Okay, what is Apache API six ingress? Apache API six Ingress controller is a controller for Kubernetes ingress resource that helps administrators manage and control ingress traffic. It use Apache API six as a deadline to provide users with dynamic routing, load balance and security policies and other filters to improve network controller and ensure high available availability and security for their business. API six Ingress support three configuration method you can use Kubernetes Ingress and customer resource or gateway API. Each of these has its own advantage. For example, if you using ingress resource, this is simple to describe and is a resource carried by kubernetes by default. It's also easy to integrate with other components. Next one is Gateway API. Gateway API is the next generation. Ingress provide rich semantics and functions. Also the last one is CRD. Apache API Six Ingress provide a site of customer resources to Apiai Six's own resource which is convenient for user to use and understand. API six ingress adopts special architecture with control plane handing routing rules without carrying building traffic. All client requests are precisely through the deadline, therefore any abnormality in the control plane will not affect the traffic. In addition, API six ingress controller has a retry module. After the control plane component is restored, the routing rules can be synced to the data plan and Apiai Six Ingress also support integration with external service discovery components. These what is Litmus Chaos? Litmus Chaos is Litmus Chaos is an open source chaos engineering framework and incubating project of the CNCF. It provides an infrastructure experiments framework to validate the stability of controllers and microservice architecture. It can simulate container level and application level environment as well as nature, force and upgrade to understand how the system respond to these trends. The framework can also explore the behavior changes between controller and applications and how controller responds to challenges in specific status. In addition, Litmus Kels offer convenient observability capabilities. It is high extensible and integratable with other tools to enable the creation of customer experiments. Kubernetes development developers and sres use litmus to manage cows in a declarative manner and identify weakness in their applications and infrastructures. Someone asked me why I chaos litmus chaos over other products. That is a topic for another time, but for summarize lead meals cows has filter functions I need and I'm more familiar with it. Okay, how to design cows experiment? This is a general procedural application to applicable to these design of the calcium experiment in any scenarios. First you should define the system under test, identify the specific components of the system you want to experiments on and develop a clear and mature objective for these experiment. This includes creating prohibitive list of the components such as hardware and software that will be tested as well as defining the scope of the experiment and the expected outcomes. Next one choose the right experiment, select an experiments that is alien. With these objective you have set and closely mimic us and real world scenario. This will help ensure that the experiment product meaningful result and accurately reflect the behavior of the system. Next one is establish hypothesis. Establish hope is about how the system will behave during the experiments and what outcome you want. This should be based on past experiments, experience or research and it should be reasonable and testable. Next one is render experiment render experiments in controlled environment such as staging environment to limit these potential for harm to the production system. Collector all relevant data during the experiment and store it security. There may be different opinions on whether the experiment should take place directly in the production environment. However, for most scenarios we need to opensource the service level objective of the system is met. The last one is evaluate the result, evaluate the result of the experiments and compare these to your harness. Analyze the data collected and document any observation or building. This includes identify any unexpected result or describe face and determine how they might affect the system. Additionally, consider how the result of the experiment can be used to improve the system. Okay, let's see the main usage scenarios of the ingress controller. Proxy traffic is the most important capability so I write it three times. The other functions are all based on these core functions. Consequently, when conducting CALC engineering normally process normally proxy traffic is these key metrics. Next we can use the general mode about two define the system under the past. We can see for API six ingress users need to create root configurations such as ingress gateway, API or CRD and apply them to the Kubernetes cluster through Kubernetes. This process goes through Kubernetes server for authentication authorization and then store it in eTCD. Then ApIaI six ingress controller continually watch over change in the Kubernetes resources. These configuration are these translated to the configuration on the database? When a client requests the database, it excites the upstream service according to the routine rules. It is clear that if Kubernetes API servers has an exception, it will prevent the configuration from being created or the ingress controller from getting the correct configuration. This is obvious and certain scenarios so no experimentation is needed. If there is exception in the deadline such as network interruption, crash or podcast, it will also not be able to do normal traffic process to do normal traffic proxy. This is also doesn't need to experiment. Therefore, the scope of our experiment is mainly the impact on the system. If the ingress controller has an exception. Next we can choose we should choose the red experiments based on the above reasons. We can directly cover many scenarios of incorrect configuration through end two end test mainly through chaos engineering to verify whether the data plan can still proxy traffic normally when the ingress controller occurred an exception such as DNS error, network interruption or port killed. Then establish for each for for each second we can create the following hypothesis. When these ingress controller get something, the client request can still get a normal response. This is our hypothesis. Next we should run the run experiments. The experiment and variable variable has been determined so all that is left it to conduct these experiment. Litmus chaos provide various way to conduct experiments. We can do this through the litmus portal. To do this, we need to create calcium scenarios, select the application to be experimented and these steps are relatively straightforward. However, we must pay attention to the factor that litmus chaos include a proberous opensource. These problem resource are plaguable checks that can be defined with chaos engine for any chaos experiment. The experiment port execute these checks based on the mode. These are defined in and factor. These are five factors. These are facts as necessary condition in determining the vector of these experiment. In addition to these standard in building checks, at the same time we can also schedule experiment which is a very variable function. Additionally, litmus also support running experiments by submit Yaml manifest these how to evaluate the resource. Litmus has building statistical repos that clear shows these result of the experiments. There are also other rich reports such as compression of experiment, pensions execution records. It can also be injected with promises and Grafana to provide a unified dashboard for integration. However, due to my current experiments scenarios, I only used the building reports these benefits and filter Apache API six is an open source project that is applied to various company and environment. Chaos Engineering has given us confidence that the delivered API six ingress is stable and reliable. Thanks to our completed end two end test, we no longer need to worry about unexpected behavior due to the introduction of new prs. Chaos Engineering also has also helped us to identify a bug. When multiple perspective killed of the API six ingress controller pod occurred, it may cause a configuration failed cause a configuration failure. Fortunately, these problem has been fixed and I'm now continual Kelsey test through a private deployment environment. I plan to introduce Kelsey experiments based on litmus into the CI environment of Apache API Six Ingress project, and I want to provide reference documents and examples for other opensource users to implement engineering ApIai six ingress in their own environment. That's all. Thank you. I'm honored to be here to share some of my experience with you. If you are interested, feel free to contact me anytime. See you.
...

Jintao Zhang

Technical Expert @ API7.AI

Jintao Zhang's LinkedIn account Jintao Zhang's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways