Conf42 Chaos Engineering 2021 - Online

Forecasting based proactive optimization of cloud resources

Video size:

Abstract

Novel concept of advanced adaptation of cloud resource using predicted demand for Cloud resources. Presented approach is using advanced forecasting methods combined with machine learning based solvers, which can dynamically adapt to changed workload. The benefits of that approach will be shown.

The presentation, with the practical examples of the novel approach to proactive optimization of cloud resources based on dynamical and anticipated use of resources. The prediction of application workload is provided as input to the advanced, machine learning based solvers which calculate the optimal deployment plan for the application to anticipate the future needs. The latest state of the art methods are used for forecasting, like ES-Hybrid and advanced Monte Carlo Tree Search based solvers are used to find the optimal solution.

Summary

  • Melodic is a single universal platform for two things. One is the deployment, automatic deployment of the application to the multi cloud. Second one is the optimization of the usage of the resources. We will show some use cases so how it can be used for the AI based application.
  • Melodic is the simplest and easiest way to use the multi cloud approach. It's probably the only one unified way to deploy virtual machines, containers, serverless and big data frameworks automatically to the different cloud providers. Melodic monitors the application and reconfigure the application to optimize that deployment.
  • First user of the melodic is AI investments. AI investments is working on the time series forecasting and also portfolio optimizations. Melodic is very well fit for the scaling machine learning training. It increased reliability and higher availability of the application. For the AI investments, the savings are quite significant.
  • How to automatically deploy own application by melodic platform. I will perform deployment of spark based application. We will monitor application metrics and observe reconfiguration process. Here is also possible to run application in simulation mode.
  • Genome is a big data application which performs some calculations and safe results in AWS. Genome's performance is managed by Spark. Melodic creates proper number of spark workers as virtual machines. Spark divides all calculations named tasks between available workers to optimize application performance and cost.
  • The next step in process deployment is deploying. Here melodic performs operations based on calculated solution. This solution is deployed for each application component. When I successfully deployed spark application by melodic I can go to Grafana.
  • Melodic is fully open source release under the Mozilla Public License 20. Follow us on the LinkedIn, Twitter, Facebook and please visit melodic site www. Melodic Cloud. If you have any questions, please do not hesitate to contact us.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, welcome in our session. My name is Carlo Skripak, I'm technical director in the Melodic project and today, together with Alaran from the Melodic project, ALA is a technical lead in this project. We will tell you about our platform, our project, and probably most important, we will tell you how to deploy application into the multicloud and how to done it with the optimized way, fully automatic to the different cloud provided and using optimized resources to save the cost and to maximize the performance of the applications. Let's start from the very beginning. What is Melodic? Melodic is a single universal platform for two things. One is the deployment, automatic deployment of the application to the multi cloud, so to the different cloud providers. And the second one is the optimization of the usage of the resources. So melodic automatically optimizes which resources are used and how to save the cost, maximize the performance and other elements of the application. Melodic is fully open resources project so it cloud be download and use. It has been created using the horizontal 2020 melodic project and is further developed by the seven Bulls company. Today we will tell you what melodic is. We will show some use cases so how it can be used for the AI based application. And here we have the example of the AI investments company which is using melodic to optimize their trainings and trainings of their machine learning models. And also we will show the example of the big data application based off nongenome which is used during the research related to the genome processing and genome mutation identification. So I hope that you will find this session interesting. The first question is why we should use platform like melodic to deploy the applications. The first and probably the most important reasons is that melodic is the simplest and easiest way to use the multi cloud approach. So instead of deploying manually application choosing or selecting manually the right cloud CTO which we want to deploy the application, it is automatically done by the melodic based on the already predefined model. The second thing is that it's probably the only one unified way to deploy virtual machines, containers, serverless and big data frameworks automatically to the different cloud providers. So melodic supports all of these components and even more. We're currently working on the support for the GPU and FPGA's accelerators to allow to deploy the machine learning workloads to the accelerated resources. As I said, it is automatically deployed to these cloud providers listed below. So we are integrated them and the deployment is done fully, automatically and even more. Melodic optimize the usage of the cloud computing resources. So melodic fetch all of the offers from the cloud providers with pricing, with technical parameters and select the best one set of the resources to deploy the application. And even more, it automatically monitor the application and reconfigure the application to optimize that deployment. It's a very unique and probably the only one platform which supports all of this. The first step is to model the application and we are using the camel cloud application modeling and execution language. It is the cloud agnostic language, not like a cloud formation or heat or so it's very similar to the Tosca, but also contains the superset of additional features. Comparing to the Tosca, we can model application so not only the infrastructure, but also the application components, connections, security. We can of course model infrastructure as well. And very important, we can define requirements, constraints and utility function value for our application. Thanks to that, we are able to transform this model into the mathematical form like a constraint programming model. And this constraint programming model is based for the optimizing resources and that's the very unique element of the melodic platform. The second unique element is the way of determining what is the best deployment. Melodic collect the metrics or the different measurements like the cpu usage, memory usage and other technical metrics, but also the business metrics like average response time to the customer, average handling time for the customer, average time for the processing given job, and so on. And additional metric could be very easily add especially the metric specific for the application. Based on that collected metrics and the current deployment, we can define so called utility function which defines what is this best deployments. Usually this deployment is a trade off between the cost, performance, availability, security and other elements. Because if we do not have this trade off, then if we want to minimize the cost, then we just do not need to deploy the application because the cost will be zero. If we want to maximize performance without minimizing the cost, then we can deploy application on the biggest virtual machines, or also usually we have this trade off between the cost and performance, availability and so on. And melodic, based on the defined utility function is able to find the optimal solution for that. And this is the very unique feature of the melodic. It is not supported, at least according to my knowledge, by other platforms. How melodic works the first step is to prepare the application, the model of the application, and set the parameters, initial values of the requirements and constraints. It is done once, only once. Of course the parameters could be adjust, but they are also adjusted by the melodic during the optimization process. So it is usually done once. And after that melodic calculates initial deployments based on the predefined parameters and deploy the application to the selected cloud providers. After the deployment, the metrics started to be collected and melodic is verifying if the certain threshold are exceeded. So for example, if the response time is too long, or the typical processing time is too long, or any other metric is exceeded, we called that SLO violation. So in case of this SLO violation, the new deployment, the new solution is calculated, new optimization is started with the new values of the metrics and the parameters. Thanks to that, melodic is continuously optimizing application to allow them and to adjust them to the current workload. So it doesn't need to be done manually or through the predefined set of rules. It is done completely automatically. Melodic is using very advanced optimization algorithms. We are using constraint programming optimization, we are using genetic optimization. We are using Monte Carlo research with neural network in the similar way as the alphago is using for solving the optimization platform. So the results of proactive optimization are really good. Here we have the overall architecture of the melodic. I maybe not go into too deeply into that, but few key elements are that we are using the microservice architecture. Each of the key module is a separate microservice. We are using enterprise service pass mule, esp in the community edition as a control plane. The logic and business flow is orchestrated through the BPM. We are using Kamunda and monitoring plane using for the sending metrics is done presented using the active MQ broker that's melodic. But we are of course still working on that. And new version is developed within the morphemic project. And the first key novelty of that project will be the polymorph architecture. So application will not only be adapt through the selecting resources, through the selecting to which cloud providers we should deploy and CTO which type of virtual machines we should choose, but we can also adapt the architecture. So we would be able to change the architecture of the application. We could decide that instead of using the virtual machines or containers, we can use the serverless components and we can use also accelerated hardware like FPGA or GPU. The second novelty will be the proactive adaptation. Actually melodic is adapting to the current workload, to the current values of the metrics. But we want to make a step farther into the future. So we want to adapt to the expected workload. We want to try to forecast execution context and to be able to make a reconfiguration anticipating for example the increase of the workflow, increase number of the customers. Actually if the number of the customers is resync, we are starting reconfiguration and it takes some times. So there is a short period when the application is not ready for this increased number of the customers. But in morphemic we want to predicted that and to add that feature to the melodic platform. So we will be able to predict the future workloads and anticipated this deployment and prepare application for that. So the proactive adaptation will work starting from the initial deployment. Then the metrics will be collected as currently collected in the melodic. But in the third point, we will forecast the future values of the metrics and the optimization of the resources will be based on the forecasted values of the metrics, and the deployment of the optimal deployment plan will be determined based on that forecasted values. And then the application will be reconfigured anticipating that workflow. Yeah. So if you find that interesting, if you find melodic and melodic platform interesting, please stay in touch with us. Follow us on LinkedIn, Twitter and Facebook, or visit the melodic cloud website. And now I want to briefly tell about the use based so who is using melodic? What are the benefits of using melodic? And after these use cases, Allah will present the live demo, how to use melodic, how it looks like. And I really encourage you to give a so first user of the melodic is AI investments. It is a polish company, fintech company, which is working on advanced methods for the portfolio optimization using AI models. AI investments is working on the time series forecasting and also portfolio optimizations. AI investment platform invest in over 200 markets, different markets, and for each market, separate forecasting model is trained and it needs to be retrained periodically. So the typical business goal for the AI investments is to train predefined number of the models in the specific time. Of course using as minimal number of resources as possible, just to save the cost. The investment analyst in the AI starts to train the models using on premises resources because it's already present, so it can be used. But if melodic determines that on premises resources, it takes too long. So for example, premises resources, it can take 3 hours. Then additional cloud resources are deployed and the number of the workers is increased. After that, new metrics value are collected and melodic is calculating the time to finish. If it's still too long, then additional resources are added and if the number of the resources is enough to process in the given time, the processing is finished and all of the cloud resources are removed just to save the cost. So that's a very typical process for the using of melodic. I think it's quite simple, but the results of that optimization are very significant. So the benefits are from the AI investments point of view, it is the very effective way of the optimization of the resources. Melodic is very well fit for the scaling machine learning training. Also we can control the budget because we could optimize the time having the budget constraints so not exceed the usage of the cloud resources. And also it increased reliability and higher availability of the application. Because in case of the failure of the one components, the melodic deploys additional one. So for the AI investments, the savings are quite significant. So we are able to save 175,000 of us dollars. In the three years perspective, it is the difference between the optimal and the non optimal solutions. Probably the real savings are lower but are still very significant. We have published these results so they are available with all of the condition and how the measurements has been done. You can find that on the melodic webpage with the more detailed use case description. Second application this is the application big data application deployed on the spark and it is based to process genome data to find the mutation in the genome in the given genome comparing to the reference one. This application is used by the one of the polish universities to perform research related to the genome, especially to the mutation in the genome. And it is a very valuable knowledge to identify some genome based disease and diseases and other issues. And comparing this mutation and the similarity between this mutation give the researchers from the university very valuable knowledge. And they are also using melodic to optimize the processing time. The use case is very similar as for the AI investments. So the researcher on the university is starting processing the given workload. So usually it is the given genome data and want to compare them with the reference data. After starting the process, melodic is trying to determine how long it takes. If it takes too long, then additional resources are added and again the metrics are collected and melodic determines how long it takes. It's still too long. So the researcher wants to finish this task in the 1 hour. But even with the additional resources, it's still too long. Somalodic is automatically added new resources and the new time to finish is below the 1 hour. Also the costs are optimized to have a balance between the processing time and cost. And of course after the finishing deployment, melodic is removing all of the cloud resources. That's the most typical use case application. Of course we have more users of the melodic. All of these stories are described on the melodic website. So I really encourage you CTO, take a look and go deeply into that. And now allah will show the live demo how the melodic works. We will go through the whole workload. So we will start with the camel model of the application. Then maybe we will not deploy the melodic platform but briefly show the melodic platform. Then allah will show how to submit the camel model to the melodic, how to start the deployment application. Then everything is done fully, automatically and at the end you can connect CTO your application and enjoy. So now it's time for the live presentation. Now I would like to present you how to automatically deploy own application by melodic platform. I will perform deployment of spark based application. We will monitor application metrics and observe reconfiguration process which is done by melodic for reasons of optimization. My melodic platform is installed on virtual machine on AWS and it is up and running. I'm locked in. Melodic users are managed by LDAP. We have three possible roles of users common user he can perform application deployments admin user he manages of users accounts and also has all privileges from common user and technical user. He is used only internally by melodic components and he is not important from client's point of view. The first step in melodic usage is the defining of cloud settings. In provider settings menu part, we can check and update providers credentials and options. As we can see in cloud definition for providers view, filling these values is required in order to perform successful deployment because they are based in contact with providers. For example, by creating virtual instances on my environment, I have already defined these values for Amazon web service and for OpenStack providers. In these definitions we provided cloud credentials and properties, for example settings for Amazon Security group or set of private images which we would like to use in our deployments. When our platform is properly configured, we can go to deployment bookmark. Today I would like to deploy genome application which was described by Pavel a moment ago. Before deployment we need to model our application with its requirements in camel model which is human, understandable and editable form. After that such model is transformed to XMI format form understandable formulaic we upload this file here by drag and drop. Now our model is being validated and after that it will be saved in database. In a minute I will be asked for field values of AWS developers credentials. Providing these credentials is required in order to save results of our genome application in AWS S freebucket. But in view of security reasons, we shouldn't put them directly in camel model file. So we use placeholders in camel file and after that we need to provide these values here. In this case, it is not the first upload of such model on this virtual machine. So these variables already exist in predicted secure store. I can verify them update if they were changed and after that choose save button in the last step I need to choose which application I want to deploy and which cloud providers I want to use. Here is also possible to run application in simulation mode. Simulation mode is the case when we don't want to deploy real virtual machines on provided platform but only check which solution will be chosen by melodic. We manually set values of metrics in simulation part and observe the result but today our aim is to perform real deployment of genome application so I leave this option turned off. We would like to deploy genome only on AWS so we chose this cloud definition. Thanks to that melodic has credentials for this provider. After that we can go to the last step here where starting deployment is available. After starting the process, in a minute we are moved to the deployment process view. Here we can observe the progress of that. In the meantime, I would like to briefly describe application which is being deployed by melodic. Now Genome is a big data application which performs some calculations and safe results in AWS as freebacket so we need to provide developers credentials to AWS. Genome's performance is managed by Spark. In genome application we use Spark as platform for big data operations which are performed parallel on many machines and managed by one machine named Sparkmaster. Sparkmaster is available by default on melodic platform. Melodic creates proper number of spark workers as virtual machines considered our requirements from camel model. Thanks to measurements of application metrics, melodic makes a decision about creating additional instances with workers or about deleting unnecessary ones. Spark divides all calculations named tasks between available workers in order to optimize application performance and cost. Please let me come back to our process. Phishing offers is the first step of deployment process. We have information about current total number of offers from previously selected providers. So in this case from AWS. From these offers melodic will choose the best solution for worker component. After choosing this box or offer option from mani which is available here, we are directed to view of all currently available offers. There are clouds with my credentials and also with my properties for security group and for filters for our private images. Also we have here hardware with information about cores, ram and disk and available locations where our virtual machines could be located and the last element here images. There are only private images visible here but of course all public images are available for us. Now I come back to our process view and we can see that the next step of process is generating constraint problem. Constraint problem is generated based off our requirements defined in KaML model. In a simple process view there are visualized all variables from constraint problem with the domain values for genome worker, cardinality worker course and provider for Spark worker. Detailed data are shown after click of this box and here are presented list of variables with additional information about component type, domain and type of this domain. Utility formula. It is used for measure utility of each possible solutions and choose the best one list of constants with types and values. They are created from user requirements and are used in melodic calculations. Here we can see for example minimum and maximum values for cardinality of spark worker instances or the same type of restriction for number of spark worker cores. So we can see that in our deployments we would like CTO have from one to maximum ten workers and the last element here list of metrics with data types and values initial values. They describe current performance of this application. Thanks to them, melodic can make a decision about triggering the reconfiguration process which means creating new additional instances or deleting not fully used ones. Thanks to metrics, melodic can do the most important task which is cost optimization. We back to process view when constraint problem is generated, it is time for rezoning. Melodic finds here the best the most profitable solution for the problem defined by us. When resulting is completed, we can observe information about calculate solution utility value and values for each variables. In that case one as worker cardinality for worker cores and provider for spark worker from zero index. So it is AWS. The next step in process deployment is deploying. Here melodic performs operations based on calculated solution. This solution is deployed for each application component. Velodic creates proper instances, remains them or deletes. If you want to have more detailed view, it is possible to see the process view using kamunda by choosing advanced view button from upper left corner. From this view, Kamunda is tool for monitoring and for modeling processes in BPM and standard and for management of them. I log in by the same credentials as for my melodic platform and in order to see detailed view in Kamunda, I need to choose running process instances and after that process to monitoring from the list. And now we can see view of chosen process with all day variables and also detailed view of the whole process with each steps. This view is for more technical users. It could be useful for example during diagnostic of some problems. We can see that now we are even here. So it is the end of our process. In order to verify this fact I go to Merlot Qi again and yes we can see that our application successfully started. So the deployment process is finished and I can check the result in your application bookmark. In this view there are displayed list of created virtual machines and functions. Genome application requires only virtual machines. We can see that melodic creates one virtual machine. As far this machine is created in AWS EC two provided in Dublin. What is more we have here button for web SSH connection which is really useful in testing process. When I successfully deployed spark application by melodic I can go to Grafana. Grafana is tool for monitoring displaying statistics and metrics. We can use them for monitoring performance of applications deployed by melodic. Each application has own metrics and own parameters to control so we need to create predicted Grafana dashboard for each of them. Also genome application has own grafana settings and we can see them here. For now metrics from our application are not available yet. We can see only that we have now one instance so one worker. In the meantime we can control our application in Sparkmaster UI. Sparkmaster is built into melodic platform so we go to the same IP address and 81 81 port in order to check the Sparkmaster UI and here we can observe list of available workers after refreshing of this view, of course we can see that now we have one worker and one running application and also one driver. So now all tasks are sent to this one worker by our spark master and it is situation after initial deployment decision about creating the new ones or deleting. Some of workers are made by melodic based on measured metrics. In such situation new process is triggered and it is named reconfiguration process. I think that now we can go again to our grafana dashboard and we can see that metrics are being correctly calculated and passed to the melodic because they are visible also in our Grafana view. Color of limitation of traffic lights inform us if application will finish on time. Now we can see the first estimation so it can be not correct. I think because we have no enough data for a good estimation so we need to wait a minute and even now we can see that our light is red. Also we can see that our time left is not enough to finish our calculate on time because the initial time indicated time because the expected time is indicated in camel model and in this case it is equal to 60 minutes. We can observe how many minutes left from this time period. Under this time left value and based on current performance it is calculate the estimated time left on the left. On the first chart we can monitor number of instances. So now we have one node so one worker and in the bottom ones it is presented number of remaining simulation. This value is decreasing with performing next task by spark. On the right on chart named number of cores we can see value of minimum course needed to finish calculations on time and current number of course under total course value the green one is the value of required number, of course, and the yellow one means current number of them. Now melodic claims that we need at least four cores and even now six cores, seven cores, seven cores and we have only one. Also estimated time is higher than time left. We can see red light. So tree are signals that our application needs more resources. In such situation, melodic makes a decision about triggering reconfiguration process. So we can suppose that in the background reconfiguration process should being done in order to verify it. I back CTo our melodic UI and I go to process view and here we can see current process and it is our reconfiguration process. In reconfiguration process, melodic doesn't fetch new offers and uses the same constraint problem as for initial deployment. For tree reasons, the first step is rezoning. As result, we can see new calculated solution which will be deployed. Now melody claims that two workers will be needed and this solution is now deployed. So in a minute we will see our new worker. Oh yes, even now our new worker should be visible because the reconfiguration process is finished. I can verify this fact also in your application part. And yes, now we have two virtual machines, two workers and this is the new one from our reconfiguration process. Also I can check this fact in our Sparkmaster UI. I need to refresh this view. And now two workers are available. We have two live workers. So now we can see that Sparkmaster divides tasks between these two workers in case of genome. In the first part of performing calculations, additional workers are created. As far as melodic measures that effectiveness of application is too low. In the final part of performance of spark jobs, melodic makes decision about deleting unnecessary instances when it is visible that application will finish on time. And now we are in our initial part of the whole process because as I mentioned, we have 60 minutes to perform the whole process. So we are at the beginning of them and now additional workers are being created because of effectiveness of our application. Now I go to Grafana view and we can see that now we have two workers, two nodes. Next tasks are done and we can see that now our estimated time is close to time left. And even now melody claims that it will be possible to finish the whole process on time. But now our estimated time is bigger again. So we can suppose that in a minute our time will be red again. And probably we will see the new reconfiguration process and the whole process is being performed to the moment where our estimated time will be enough for us, enough for our requirements and thanks to that, finishing the whole process in our expected time will be possible, right? So we successfully observed the configuration process of spark application. This is the end of spark application deployment done by melodic demonstration and we can see that the whole optimization process is done fully automatically. Okay, so thank you very much for your attention and this is all from my side. Thank you. Thank you Allah. I hope you enjoyed that presentation. If you have any question, please do not hesitate to ask these questions. And also as I said, I really encourage to download melodic and at least give a try. Especially that melodic is fully open source release under the Mozilla Public License 20. So you can use that for any type of workloads, commercial and non commercial one. And also please follow us on the LinkedIn, Twitter, Facebook and please visit melodic site www. Melodic Cloud. If you have any questions, please do not hesitate to contact us. Also, I really encourage to follow our morphemic project which is the successor of the melodic and aim for the further development of the melodic platform with really novel and unique concept like polymorphic adaptation and proactive adaptation. So stay tuned, take a look and I think that it could be very interesting see what are the results will be. And once again thank you for the invitation. We are very happy that able to present you this session and I hope that you would find that useful. Thank you.
...

Paweł Skrzypek

CTO @ 7bulls.com

Paweł Skrzypek's LinkedIn account

Alicja Reniewicz

Full Stack Engineer @ 7bulls.com

Alicja Reniewicz's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)