Conf42 Cloud Native 2024 - Online

Journey Beyond: AWS Quest for Excellence

Video size:

Abstract

Will you take the challenge to revolutionize your company’s first cloud experience? I invite you to go through my cloud journey prism to uncover the strategic decision-making, CloudOps mastering, and uncover real-life experience with AWS to empower your business needs and get the best from the cloud.

Summary

  • Dmytro Hlotenko: Welcome to Comforted to Cloud Native 2024. In this session I will give you some specific details on AWS service. Tell you a story, how I was modernizing the already existing service. Please take a good position because it will be session with some details with a real life story.
  • Mitri Glatanka is a cloud engineer of up it and how my colleagues call me Mr. Amazon. If you have any questions after this session, you will have my LinkedIn QR code at the end. You are welcome to send me the invitation.
  • The topic of our discussion today is meteor contact plus. It's one of the applications that we are running on the backbone of AWS. With these services, it gives too many opportunities that you can utilize to deliver your business targets. What I absolutely love about AWS is that it does not create any headache.
  • Cloudwatch is amazing thing. If you will just take some time to analyze the logging, analyze existing metrics, maybe onboard some additional metrics, you will have lots of valuable data. Insights and dashboards. You can do a lot on AWS, but as you know, every press on AWS costs money.
  • First cloud deployment that was in the company. Goal is to remove as much as blind spots as we can and improve efficiency. Right size also belongs to it, set up reliable, it should be taken into account for every production deployment. Security as I mentioned also is in this list and target for me was to get rid of operational overhead.
  • AWS fault injection service allows you to break parts of the setup granularly. Performance insights is even more valuable than the RGS performance monitoring. Most important, it's cheap out of the box, you have seven day trial.
  • The most important about running on AWS is to understand if you are taking the proper service for doing the thing that you need. I highly recommend use the ECs on the Fargate. It was important for us to avoid vendor lock and vendor lock.
  • Fargate is a cool service. Usually you don't have the operational overhead, it just runs your stuff. The balance is extremely important and you should understand how your effort and price for the service compliance. For me it was crucial to use the instances with a better network performance.
  • Graviton instances give good performance for the good value. But you should understand if the effort to make your application running on gravitons can be covered by something else on AWS. Spot fleets are based sometimes 60, 70 80% cheaper than on demand.
  • Automation builds the new EC two image with the fresh image, updates the dependencies, then provisions. The application runs automated testing which is done by Lambdas too. When testing is successful I get full image that I'm ready to use. It takes into account business hours.
  • GP three is amazing drive which is able to provide you consistent performance without of any limits and it's even cheaper. Check if your RTS drive is encrypted. Latencies are also important because the response time directly affects your application performance.
  • Dima: If you want to have the RDS instance, take the graviton instance. You can save up to 50 $70 by using T 4G instead of M six g. What Aurora resolves, it eliminates the need of application re engineering.
  • Yeah, so in the whole process it's important to communicate with your team. You must know the details because we can talk about the database. With AWS you can have everything covered and done. And this is amazing and I'm really excited to work with such service.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Ancir was from Vienna. Welcome to Comforted to Cloud Native 2024. I'm Dmytro Hlotenko and I am pleased to welcome you on my session journey beyond AWS quest for the excellence. So in this session I will be happy to give you some specific details on AWS service. Tell you a story, how I was modernizing the already existing service and how I was resolving different issue and I made the overall experience with running our application on AWS much better for our solutions team, for our developers and for our product team. And please take a good position because it will be session with some specific details with a real life story. So I'm really excited to have you here and since we are limited on the time, let's go and jump into the session. But first of all, just a short words about of me. I'm Mitri Glatanka. I am a cloud engineer of up it and how my colleagues call me Mr. Amazon. I am AWS community builder since 2024 and I'm co leader of AWS user Grouping. I am holding a few degrees related to telecommunications and in it I'm already for a long time I saw different things in different places, but I'm really enjoying cloud everything which is related to architecture software and I'm really happy to deliver not only speeches of the conference but services to our customers and interact with the different exciting people. And yeah, so I'm also a motorsport fan and photography hobbyist and in terms of AWS, I'm still targeting the clothing jacket, but this is just a long, very way to go. And if you will have any questions after this session, you will have my LinkedIn QR code at the end. Please. You are welcome to send me the invitation. I'm always happy to discuss any points to help you. And yeah, so I would like just to announce what we will have today tonight. So I'm going to give you some information what actually we are doing with AWS because actually this case, it was one of the first cases that I have refactored since I have started in Apait. I will give you brief information on application and in general some details how the services are cooperating, how it was built, how it was reworked, and some small details that probably will give you some inspiration for your workload or will give you the idea for your potential case to be resolved. So first of all, just a short information about my company. So via Apait we are maintaining most of the biggest austrian media companies such as ORF, which is a national broadcast, lots of magazines like Derstandart. And we have not only media customers, but we are also developing the special applications for journalists, applications for communication, data processing, our press agencies coupled with the different other smaller european agencies, and we are processing a huge amount of information in order to deliver different media to the public, to deliver the true news and some other different exciting things. And we have actually two data centers in bin and I'm like a critical person for the cloud topics and with AWS. So I have already put my hand on different cloud integrations in different applications that we are running. So we are running lots of mobile applications on AWS. We do the analytics, data processing and of course since we have critical appliances, we are doing the disaster recovery. But this talk is not about it and media publishing and processing, because AWS with these services, it gives too many opportunities that you can utilize to deliver your business targets and build completely different business logics for different cases. And actually the topic of our discussion today is meteor contact plus. It's one of the applications that we are running. So the idea behind was to create a centralized solution that will deliver. For example, you have some exciting announcement of your company that you launched a satellite or new invention, whatever, and you can come to Mediacontact plus select the best journalists who will take the best from your information and share it for the good auditory. So you just need to tell what, where and application is built in such manner that you will just get it delivered and your information will be processed and spread. And actually since it's powered by AWS, it was actually a first project that APIC delivered in the cloud. And AWS is actually an amazing solution. We have a few projects that relies on it and we are powering the complete service is built on the backbone of AWS. So this is a fundamental thing and a good example how the single simple email service can power not simple application. And we have different integrations of other APA systems just to make sure that messages are delivered properly and they have resonance measurements. So we are processing lots of data which comes from AWS, SAS from AWS cloud from, and all this data is processed by automation. I'm going to present to you a bit later in this session, but just there is a unique integration of AWS solution in the real product, but also we should couplely integrate it with the real people who are working in the product team in Apple Cs. Yeah, this was running since 2021 and yeah, outcome of why actually this session is here today that I was trying to maximize the outcome of usage of AWS because initially it was built not by me and I just took over the project and just made it better. Yeah, I just came, I rebuilt and now it. Yeah. Just a few words on AWS. What I absolutely love it. It does not create any headache. For example we have lots of customers, I don't want to lie, but lots of big companies who are already doing sending through it and setup is pretty easy. You just have to follow some small things like the mark setup, you have to make the proper DkIM alignment SPF records, you have to make sure that domain is validated properly and you have just to follow some small guidelines from AWS. For example, if it's marketing you should include the unfollow link in your body of the message. AWS controls this and you should take care about it because we have to fight this pump together and setup is extremely easy and it does not limit you in any metas. You would like to have it. So you can use it like a normal SMTP client or you can use it like API integration for your application and of course it can be a part of complete serverless project. I use Seth also for my own purposes, not only for the commercial ones. And what actually I hate about of it because it's like a love and hate that's normal in any situation with interesting things. Complaint rate is a bit specific because for example you might be complained because the IP address from which SS is operating, it comes out from some strange lease which are not even accessible directly by the link. And yeah it may hurt your delivery rate. And this is important because actually this is a messaging thing and you want to have the things delivered and yeah but out of the box you don't have any logging, any tracking, you have to build it ourselves. But I will show you the example, maybe that will also be useful for you. And yeah, as it says it's simple service and it just gives you sending the functionality. And I'm pretty excited about of it because it's a good foundation for the things and I don't think there are much better services who can do the same without comparable effort. But you just need to take some things into account. But proper domain configuration on AWS is essential and that's why we take some time to track the delivery rate. We take some additional notices because of it. But in general SAS is amazing service. So if you just have to send the email, this is the way to go. And going a bit back to the global vision of the application. So you're coming to the company. So let's just imagine the case you have the perfectly working application. So you are a cloud architect, engineer, Ops position doesn't matter. And you see that. Well, since it was the first experience for your company, probably it was built nice, your colleagues are great people and of course we can leave everything AWS, it is don't touch if it works. This is like a mantra, but this is not about of us. We are interested to gain benefits of the cloud for our company, for ourselves, and be more developed and be cooler, specialist. And of course we can also drink coffee and relax. So if it works, why should we stress? But yeah, this is not the way and this is why we are starting the analysis. And yeah, this is our work and we don't have to yield on the cloud because cloud actually is amazing. And what I just would like to underline that. So first of all, the best way to understand what you have is just see how it runs, what are you running, who is running it, and just talk to your colleagues because they have already done it, of course they have analyzed the situation, they have took the decisions. And first of all, you have lots of different information. Data is extremely valuable in this case. And thankfully AWS gives you lots of opportunities why and where you can have this data from the different perspectives and only from AWS services, you will have already the half of information that you will have. So just let's have a quick look. You have to talk to the people, this is extremely important. And you cannot just deliver the thing on yourself and make a statement that this is this. Of course if you have some authority, you can do it, but it still be easier. It's already running and you must see why and how it runs. So Cloudwatch is amazing thing. If you will just take some time to analyze the logging, analyze existing metrics, maybe onboard some additional metrics, you will have lots of valuable data because then you can process the behavior of your customers which are visiting the application, you can see how good are you scaled or how good are you utilizing the actual resources that you have created. Maybe you have underutilized environment and you can just slice the size of your instances or for example RDs, whatever you have and be much better. Insights and dashboards. It's a very amazing thing to gather all the data together and with the insights you can run the queries, for example, for some specific message, for example that comes out from your application. And of course you must be connected to the cloud, watch for example with a docket driver or something else, whatever you use. And one of the things that I would like to mention, and I will mention separately is AWS fault injection service because despite any application can be running perfectly. But you don't have to be surprised. Of course it's nice to have a surprise, for example, some BMW parked in front of your windows. But when you are waking up at 02:00 p.m. From your phone, from your operations team, that is broken and you don't have recovery plans, you don't have anything. So fault injection service is amazing because you can simulate breakage of AWS things, you can break your application, you can break the networking and just understand what is missing, where you have the missed observability, where you have missed maybe some functions in your application, what is missing. So it's amazing tool to remove the blind spots for you. And yeah, I still have to admit that Apache G Metro, it's a good old tool, but in this case I also use the Robert framework and selenium to simulate our users and to understand how much and what we can run and in which way. And of course since also the part of this talk is how to comply with the existing processes, I really like the cloud watch because yeah, it basically does lots of things, but checkmk out of the box provides you most essential things that you would like to monitor in your AWS account. It does after detection of the most important things and it does pretty amazing job that you can just have a look how it works and use it like a foundation for your observability pipeline. Yeah, and from the people. For example, in this case I was deeply working with our developers, with our solutions architect, because he was maintaining the setup before it was important to find the reasons. So why for example it was not made serverless or for example why this certain database was picked or whatever. And also it's very important to know your customer, what he will do, in which amount, how he will do it. This is why you need to communicate with the product team and the second reason why you have to communicate with the product case. You can do a lot on AWS, but as you know, every press on AWS costs money. And your target aws a cloud architect not only to deliver the efficient solution, but also make it good from their cost side. Because your business thing must be profitable because otherwise you will have no job and you must take lots of things in the cloud from the technical perspective, from the other perspective, you should see what you already have. And I have to admit that one of the best cases to understand what you have, especially if your team initially didn't use cloudformation on terraform or Opentofu doesn't matter. You have to understand with what are you working, how it runs, where it runs, how much data it consumes consumes. And one of the best ways to understand what you have is just to use infrastructure as code. If you already have the template, you are lucky you can just go through it, see which resources, what is where connected. But if you don't have it and you are coming to the blank project with a minimal documentation that has just some charts of the setup or whatever, and it's an amazing way to get understanding what you run. But also when you are working with AWS services, you don't have to reinvent the wind wheel because most of the things that they are already created by AWS. For example, you don't need to run the EC two with the database because you can go with the RGAs and you will have the managed database. Or you don't have to run some specific things for EFS because you can use AWS EFS or you don't have to set up extra monitoring. You can try to utilize cloud watch natively in your project, which is amazing because you have lots of metrics you can process and Et cetera, et. So for example, if AWS says that, as I already said with the example of database, that you can use RDS for your things and then you should prefer to take the RDS. And actually one hint for the certifications, you should know the AWS way for the things because it basically pops up in the most of certification fashions. And if you will be getting AWS certified, this is essential not only to use the services but know how AWS will encourage you. And I also remember one good example is to have for example, not host the Kafka on itself. You can go to AWs, MQ and etc. And automate the routine because time is gold and you get paid for your work. And for example, you don't want to come to some certain account and press some button because nobody of you know if it can be automated, it must be automated. But don't automate everything, understand the value of the things because when we are talking about every implementation, you should understand that effort also costs money, because every service also costs money. But you should find the spot of the balance, how much effort you will invest and what will be the output from this and actually finding the big spots. Also a good point because application must be resilient. You must have almost no downtimes as much as you can, especially in the business times, you have to avoid it and for example, you should see from the security perspective, you don't have to have the exposed credits or have unpatched things and some other things. And for example, also going forward, if you can save on some things by using spot instances or by using different type of the instance or some smaller scale of RDs, because it's just underused and you have to use the resources for what are you paying. So maybe you can optimize it this way. And of course security is important and thankfully AWS gives you lots of good practices how and what can be done. You can use additional things and basically the baselines for architecting and service implementation. You can get AWS secure as you can, but you still have to be careful and include security on the stage of software development, building, imaging and deploying it to AWS. So going forward, let's talk about the evolution, actually what we did and how I did it. But first of all, I would like to remind you that this is the first cloud deployment that was in the company, and we have already developers, lots of amazing developers who work with lots of things, but they are used to work in some specific ways. And what is important in every relationship doesn't matter, romantic business, et cetera. You don't have to scare anyone and everything must be done softly. And our target as a cloud engineer is to precisely, but very carefully introduce the cloud things in the company because otherwise things may play it against of you and you will have to deal with some VMware host and call it cloud. But yeah, so the target here was not to cause any additional headache for the developers, just to give them what they have and do as much as we can on the side of the cloud completely transparently for them. And just if there is an improvement that will minimize their work, just notify them. And goal is to remove as much as blind spots as we can and improve efficiency. Because efficiency is the most important for our application when we already know what we have, how it runs, so what it needs and who uses it. It's nice to ask yourself such questions. So of course, as I mentioned, we don't want to break the processes. Can we run more cost efficiently? It's the most essential because there are lots of opportunities to optimize your costs from using any phenotypes tools, from just having a good engineer or just for example use the AWS config. And there is a case when for example, the AWS config was able to find lots of unused snapshots of EBS that were just burning the money and this is why you also have to use it and wealthy for our application that we are running on the proper instance with the proper database size. So we don't want to underutilize or overutilize the things. Right size also belongs to it, set up reliable, it should be taken into account for every production deployment because your service must be steady AWS you can and for example this is why I will go forward and show you some things that directly influence the liability. Monetary coverage is also essential because then you might know how to behave in some situations or just to get notified to minimize the downtimes. This is important. Security as I mentioned also is in this list and target for me was to get rid of operational overhead and this is why I brought some of automations to this project. And let's have a look. So this is the initial setup. Basically it might remind you the most simple or very, I would say vanilla deployment of AWS. We have just something running on EC two with something. So we use the code, commit, code build, we build the things. We have a few staging accounts so everything is pretty simple, it just works. But this is nothing fancy. And of course we can see some things that we can use, the private subnets, we can use, some other implementations we can utilize the web, we can use two RDS instances, second one for the standby to improve the resiliency because then you will have a short of a lower time. And you can also think about the backups and maybe some automation because right now on this stage there is no delivery of the things coming from the main account. So as I said, yeah it was basically we were just running containers on the EC. Two instances because this is like a normal Linux virtual machine and you can have a normal docker who everybody is used but they are on demand and we have to ask them is it actually a right thing for us single RDs, it's one of the most dangerous things that you can have because then you will have for example 1520 minutes of failover and especially if you don't have the non credit system, especially for your storage or for the instance type, if you don't use the production grade type you can stuck to the credits and if you don't monitor them you will be very unpleasantly surprised. No automation, no comment on this. So provision time also was essential because it affects the recovery time and prism was a way how to improve it. And it came out from analysis of application behavior. I went through the startup process but yeah there is also the way monitoring was extremely basic we were just checking the endpoint from external endpoint of the load balancer, not even the application itself. And some things, for example we don't use security, we didn't use the security management or secret management. Vast was missing and some small things. And to redeploy the thing you have to actually come to the console and do the click ops. Well of course it can be easily resolved. But what I would like to underline again the target was not to break the existing setup which was working and with which developers are familiar. I didn't want to introduce any extra things or new services like ECS or whatever. The target for me was to keep the same baseline. Of course there are lots of things that you can do and please have a look on the current setup. As far as you can see lots of things appeared. And here we have a different designation of the running application. Core subnets were moved, we have different database, we got the EFS for the caching of the shared data. We got some security checking by inspector, by the config, by guard duty for this application we got additional integrations which just uses the data that comes from the application, from the Amazon services that for example as far as you can see on the right side, they can contact our product team or they can contact me and the building process is automatical. And the most important that the whole setup was fit in the processes of the company, how it was working already for a couple of, maybe a long period of time I would say. And yeah, so what we get after the transformation, I have improved the performance even because we were able to, by reducing the costs, by changing the approaches we could take new things that probably would be better fit for our application. That's why you have to analyze the performance and we deeply increase the monitoring coverage. We control the metrics from the cloud front, from the RTs, from the application itself, the things that can come up in the logs of application there is a responsive SAS failures, the connection and actually there are lots of valuable data that you can just go and collect around. For example like mushrooms after the rain. And now we have fully automated deployment which companys with the ICL process and complies with the whole change process and everything you have to do just to approve it and automation will just get the deployment done for you. And if something goes wrong it goes back. And security and resiliency was also improved because we don't have any exposed things, things were more hiding. And we have a few additional security checks, conformance checks and this is important. Yeah, as I mentioned already SAS events and some other of the small improvements. And just going back to the scheme, I just wanted to mention that this is a multi account architecture. So we have a main account that runs the production, we have the staging account, best account for our developers, and we have the separated account which acts AWS repository. And there is a big conjunction between those things, but it lets you keep the things more tidy because then you can have a lot of the small things just more separately and you will have less points of the breakage. And what I also important to mention that it's not only about controlling the stuff in AWS. As you can see, we also control what is incoming from the systems on which the application so the next part of this presentation, what I would like to tell you so sadly I cannot ask you, but please write to the comments write to me to the LinkedIn what is your favorite AWS service? Mine is actually it's a bunch of services which are building the basement of the idea that AWS is basically a Lego and you can take any service and make it yours. But for foundation of any expansion and interaction with AWS services, it's Lambda Eventbridge SNs and Cloudwatch locks. This is the amazing Ford that I would say. And lambda is so amazingly integrated with the different things and just let you do the thing that you want. If you just spend some time or coding or asking Chen GPT of course, and you can automate this stuff. With the event breach, you can have communication with the SQS or SNS. And this is why SAS does not make much headache for us, because we can react to the bounce events, we can react to the complaints, we can process the logging, we can gather the data out of it. But since it's simple service, by taking lambda and some of the things, and not only on the SAS, actually this is this. With every AWS service you can create amazing things, but you just have to understand what you want and what you need. And if you want to go even deeper, you can build the step functions. I absolutely love step functions because they are covering lots of amazing things for you. And what is important, that you can also use the s three for caching or dynamic B for the storaging. AWS gives you lots of opportunities that you have just to come and use. So let's go forward and yeah, as I have already mentioned about the AWS fault injection service, it allows you to break parts of the setup granularly, so you can break the networking. You can kill the database if you have the microservices and actually, unfortunately, I forgot to mention that MediaConnect, plus it has three tightly running containers and one node and those microservices, they are taking care of different parts of application. And for me, it was essential to understand how it would behave if one of the microservices was took out. And fault injection service actually saves lots of time. You don't have to automate the things, and it's a game changer for observability coverage, because you don't have to wait for the event to be happened. You just can simulate the event, do the analysis, tell the outcome for your developers, understand how it would behave for yourself, what is Missy can just do it and it can automate some routine checks or something else. And yes, you can expand it. You can write systems manager pipelines that can come to your host or to your container, do some different weird things. And most important, it rolls back all the changes what it does. And it's even cooler now because it's a part of the resilience hub. And resilience hub, it not only allows you to break the things if you are just starting or if you are just not familiar, it gives you advices how you can proceed. So together with things, you can do a huge work which is basically called house engineering, and you can get ready with this service. And then what I also really like what is really missing. If you are running the database somewhere, I don't know where you can run it. But why? I like RDS so much that it has performance insights. And it's absolutely lifesaver for performance troubleshooting because you can just come to performance insights and see do you have the problems which are related to your instance, or if you have problems that are related to your application. For example, if SQL query was not built properly by your developer, it's amazingly simply to set up with the postgres, it's not an issue. But for the MySQL, you have to have some certain size of RGS instance. I guess it's bigger than the medium. And in our case, unfortunately, we had the RGS performance issue, and performance insights showed us the exact point, what was not running properly from the application side, and why the current AWS setup was not that good. And basically performance insights, you just press the button, you modify the RTS instance, you wait just a few minutes, and then you have the amazing source that can help you resolve all the potential issues. The coverage is simply amazing. And most important, it's cheap out of the box, you have seven day trial. But for example, just for three or $4 you will have three months of this service and you will have a big backbone of your data gathered from your real time running application with the real users and you can do long term observability if everything comes in. And from my perspective, performance insights is even more valuable than the RGS performance monitoring itself. And it's so amazing and you have to have it activated. Yeah, and going back to the swiss army knife, this is an example of automation that we have coming from AWS, SAS. SAS is doing some action, email sending whatever it makes the record to this logs. We have even breach rule that for example once a week runs a lambda that does the query over the cloudwatch logs, creates the report that then goes to our team. Then you are eliminating the thing that your colleagues must be AWS proficient to come to assess and whatever with this, using AWS services you can build thing that will be familiar for them and then they will be happy that they receive the data that they understand and they don't have to manage anything. And also going a bit out of this since you can grab any data from the SAS and this is why I build different dashboards and my product owner, he just comes to his account, not to his, to our account, sorry, that runs the application. He comes to the shared dashboard and he sees the sendings from which customer where if we have had the bounces responses and also because of it, there are a few more lambdas, we can mark some recipients AWS broken whatever. We can also interact with a database to give our application better response. And for example, if it's hard to re engineer the application and you need to have some specific function and you have it in AWS. This is an amazing example how you, the cloud engineer can expand some specific backend functionality by not making any changes to application. But you will have to think resolved. But this is simply amazing. And this is why you have to use lambdas together with the rest of the services on AWS. So coming back to the media contact class itself, I would like to talk about the running the stuff and what is the most important about running on AWS is to understand if you are taking the proper service for doing the thing that you need. And in this case we were running the application, it was already dockerized, it's microservices, et cetera. But we have a few opportunities which is actually just running the EC two, which is for most of the people could be scary because of the management, but it's not that bad actually. You can use the ECs and I really like ECs. It gets just the thing done. But unfortunately it's not the case here because of course this is a new service and it's new for the team and it just runs differently to the Kubernetes for example. And this is something new and this is not like a docker that you run. But I highly recommend use the ECs on the Fargate. It must be one of your first considerations if you want to run the application on AWS, then of course you have the EKs. But you have to ask yourself if you want to have the EKs because it's real life Kubernetes. And from my perspective it's a bit built on the site from AWS. If I would say that ECS is AWS native, but EKS is not that good integrated and you have to understand. But for example, ECS might be amazing for your small application or for some small business needs. Meanwhile, EKs and carpenter, those amazing things for the big scale and et cetera. And if you go even bigger, you can have the RoSa, you can have the openshift on AWS. But yeah, this is the real enterprise thing and I don't think that you will need it. But I would prefer the RoSa to eks because it runs so transparently and it works really great. And you don't have much vendor lock to some specific cloud provider like Azure, AWS, GCP, whatever. And what you also make take into consideration. And it was important for us to avoid vendor lock and vendor lock. It's the thing that you don't want to link to something. And for us what's important is that we can anytime take out the application from AWS, put it on our data center for example, or to relocate to it, to another cloud, but just in case. And the target was to be as flexible as we can and we wanted to have the control over the things. That's why I was not changing the underlying approach that we have already had for a few weeks, sorry, years. Not only weeks. Yeah, it's much bigger. And when you are taking the service into the consideration, you have to understand the balance. But AWS is doing amazing job and most of them are cool. You just need to understand how to use them properly. And this is why you need to work with the services, you need to know the specific details and et cetera. And since we are decided to stick with the EC two s, I would say so, yeah. As I mentioned, ECS is a new technology I wanted to avoid and EKs is just redundant. And unfortunately this project was not that big and it was kind of limited on the budget and EC two, they are just running the well known Docker and three microservices on the host and they are pretty amazingly fitting on one instance, but I have just changed the approach. Initially we just had one node that runs altogether and then the thing is we have a different load on some specific microservices. We couldn't scale them independently. If we scale we get the same node with the same things and I had to a bit divide them, but still we have the easy to in the base and this is not that scary and I will show you why. But first of all let's see why budget was heavy consideration for taking the baseline for our application. As far you can see I would like to take the bare price of EC twos which are just provisioning the application as the baseline 100% savings plans. They are amazing. You are not stuck to the reservation and for example if you have the organization and you are running most of your applications on T three or M six c five, whatever and depends on what you need. But TT three t three A is the most well known case. You can purchase some of the savings plans, if it's underused you can share it on just switch to another account. But what is amazing, you already have about 40% of discount and the only way how to make it better is to switch on the spots. But some of the people, they are afraid of the spots. But you shouldn't and I will show you why. But just a few slides after and Fargate Fargate is a cool service. I really like it because usually you don't have the operational overhead, it just runs your stuff. But please get ready to cover the bill because if I would take the fargate for our application my product owner would kill me because the pricing is about 473%. This is what I mentioned. The balance is extremely important and you should understand how your effort and price for the service compliance. Of course you almost have no effort, it runs it for you, but in cost of extreme volume. For example, if we still take the EKs with easy to workers, of course you can make the discounts with the savings plan, compensate the pricing for the eks cluster, but you have the overhead. And you should ask yourself do I really need a complete kubernetes for some business purposes? Maybe it can done by ECS or even in this way. Yeah, and just to mention, of course I have resources for you. So Kubernetes instance calculator is amazing because you can tell it that you are using the EKs and it will give you the advice about the instance per usage which the ones you would like to take for your EKs cluster and how your applications with certain limits are fitting on your cluster. And another amazing thing is Fargate pricing calculator. That AWS, you know, AWS calculator is a bit confusing, but Fargate pricing calculator, just a few clicks of the most essential information and you get the pricing if you really don't want to manage anything and you want just to have the things running. But in our case we decided to still stay with the Isitos. Some of you might say that we are crazy, but it's a working way and we have built some our system that does our thing and with almost no difference in the pricing. So we have amazing performance but we don't pay as much and we don't have any operational overhead because we were just well prepared for it. So as I mentioned, for us was important to keep control on the host because we have some specific data processing guidelines. We have to be compliant with some of the things because we want to have the control despite we really trust to AWS, they have a very high security standards. But yeah, we want to know what we have running and where we have running. So also for me was important consideration. For example, what will be the benefit if I will move everything to the ECS, how much time I will take for this and for every action that if you would like to rearchitect something you should have this significant benefit to do it. Also we have a tightly coupled microservices, but we still have to them like alive together and we should have the load properly scaled. And I wanted to avoid the cases when for example we hit some Kubernetes limit and it's popped out or the host is over provisioned. And also why you have to understand the instance type that you use in case of MCP. It was very CPU gentle application, but it was very heavy on the memory usage and on the input output and networking. So for me it was crucial to use the instances with a better network performance and better memory management rather than CPUs that we will just not, sorry, we will just not use. And yeah, people would say that it's hard because you are just working with a plain host, but thankfully AWS even thought about us here. But if it's just a small thing which can be done in a few hours, why should we spend the extra money that can be spent for something else and improve other perspectives in your project? And if we are talking about the workloads. One of the first things that are coming to the mind, to the mind, if we have some instances in the consideration that we can use the graviton instances, and they are really great because this is an amazing example of application of iron technology. They are giving you a good performance for the good value. But the problem with the gravitons actually is this one on the slide, because yes, despite it's a containerized application, you have to take care into account that your dependencies are able to run on IRM, that you are not using any specific libraries that can rely, for example on AVX extractions. That absence of AVX instructions can cause a huge performance drop for you. And you also have to use some extra things like Docker x or Jeep. And if you already have too many other applications that have a huge dependence on your Jenkins builder, you probably will not want to change things. This is why I think Graviton should be considered from the beginning. But if it's already in the middle of the bay, it's a bit hard to make the translation. I also heard there are some cases not only with the specific integrations, but with some specific monitoring tools, et cetera, et cetera. So yeah, you can save the money on running the graviton, but you should also understand that you should understand if the effort to make your application running on gravitons can be covered by something else on AWS. This is what actually our example is. I just took another approach and saved some money, but coming back to the graviton, so maybe you have to solve the base image and if you use Amazon, Carrera or something else, you have just to rebuild it. You have to done long term testing. Yeah, as I mentioned, some things may be broken, but if you don't have something else, you should try to use the Kraviton instances and you can try architecting and planning from it for the beginning. And another thing for me was a bit hard to motivate the colleagues also to change the pipeline. Despite of some financial benefits, you still have to take care about running, testing, et cetera. And just taking graviton instance doesn't give the benefits out of the box and it will not do the things for you. What will we do? I'm just redelegated the things. And of course Spotfleet is amazing things that you are using the microservices and you want to be scalable because if you are afraid about taking out the instance from the service, you should not. And first of all with the spot fleet you can have the main on demand instance that can have the savings plan and it can be reserved just to have your things running. And we have a different hosts in the separate auto scaling group that are attached to our auto application load balancer. So you can always guarantee your availability by having the spot credit. One instance will always serve the connections for you and the rest of capacity during the business hours can be gathered with the spot instances that are based sometimes 60, 70 80% cheaper than on demand. And even because of this you can deliver the better experience to your customers because your application will be running on the faster hardware but you will have even less bill for this. And yeah, so for me it was important to slice because of not consistent load on the microservices and yeah, so we have three groups serving the things but you must give the try to the spot fleet. If you are having the stuff on AWS then yeah, as I have mentioned, in comparison with the on demand you can save over 60% and you can also have the savings plan on your on demands and you already have a big savings that you can use for example for security services or increasing the observability coverage. And it's simply amazing that such a discount is present. And you must be not afraid about the termination factor. Why? Because you have the notice from AWS and you are backed from two sides. First one that your application is still running on the on demands, but from another side you have a message coming from AWS. Hey, we will take this instance from you so you can be ready, so you can panic, you can do nothing. But in our case on this slide, as far as you can see, application is interacting with the secrets manager. It's not about it, it's just about the security with the Aurora. So we got rid of RDS and switched to Aurora later during the best and application knows where does it run. So important tasks like database migrations or some batch things in the progress they are coming to the main EC two instance which is on demand, it's always here but interruptible or some short living tasks, they are coming to the spots. So we're having the free app on the main instance and rest of the unnecessary loads and they're handled by them and your application just knows that it's in spot. We have a pool, they are interconnected between each other and two minute notes is mostly enough to finish your things or have it saved in some certain space state that can be taken and you can continue with the processing the data. So this is how you can not be afraid of using the spot instances well what is actually the next. So for example, since we are running on the isitos, the node provision time is crucial here. And since you are using the auto scaling groups and some small other things, you are also covered from a few perspectives here. So first of all you can use the scheduled scaling. So as I mentioned, it's important to know your user. So for example you can start provisioning extra nodes before the peak business hours and then when it goes on the deck line you can also have them going down. So you will be already prepared. You can use the performance tracking or some things that for example you can use the metrics from the Cloudwatch to scale your application. But what is also amazing from the auto scaling group site is a warm pool. With a warm pool you can prebake your instances that will be just taken away from the service, but they always be ready to come and help you. And I don't know why but yeah, so it's not that much. Spread it here. And in our case if we are running the application on EC two s for me the lifesaver was to use the easy to image builder and caching of the shared data on EFS. Also helping because you just don't have to have some same things, for example of the same picture like on every instance. Why not just tell me, come here and look on the picture. And this is what you can do with the EFS. But be careful, EFS sometimes has a very weird performance because it's a net attached drive, but if it works for you then use it and don't duplicate the things. And coming back to the EC two image builder. So initially application node was taking about 500 seconds to boot up and then be alive in the service. And of course it solves the updates, it fetches the images from the ECR, it just comes and running. So the first idea was to provide always a fresh image for our system. Of course you cloud swap the AMIs, but EC two image builder just rebuilds it and it interacts with our system management. Done by for me that I open the change request, my team lead approves it for me, sends the callback to me and then I just get a notification that system was rebuilt and then we provisioned the new nodes with a fresh image. So we are already saving some time because we have the good image. What's next? The idea was that we can also cache the docker images on the host, not only by the docker itself, because you are starting from the scratch, you don't have any caching out of the box. For example if my developer changes something in one microservice but the rest of them are not changes. So we just refresh one service and it already saves the health of the time that we have. But I also went through the application startup time and since it does gathering of caching from the database and from other things, EFS stores the things that are on the hot plate and we have the response time now for the application even better than if I would be running it on the yeah, you may see the ELB didn't like so because such big startup time out of the box was delaying the default health check grace period. But actually what I also would like to mention some people don't know about of it. There is an amazing service from the red Hat and if you already having the red hat stuff running on your environment on your on premises. If you have some red hat subscriptions, what you can do, you can take your subscriptions to AWS and then you will have the access to bring your own license service. It's extremely easy to set up. You just connect your accounts to the red hat. You can use the cloud formations taxite set to permit all the necessary things that red hat integration wants from you and you just come to the red Hat council give to some accounts and you already have it. And as a bonus you have the management. But what you have to understand that you have to do some tweaking. If you are red hat administrator you know what to do. But out of the box bring your own license images. They are not that good for intensive auto scaling. So you have to reduce some configurations which are related to the system activation and things. But if you have your licenses, please use the cloud access and you will have your red hat stuff on AWS and you will be not built hourly for the war by second since some changes are upcoming. If you want to use the red hell for the things. Yeah, and actually regarding the building of the image, this is the scheme about the stuff that we have. And as I said before, we have a few accounts we have for me for change management and resource management. So what is happening? For example, we have a few sources of the triggers. We can process the release of the new version in the production branch. We can interact to the Amazon inspector findings. For example, if there is a critical vulnerability, it will even skip the change process and patch it immediately. If the testing is succeeded or if red hat releases the new image, then even breach takes some messages coming from the services, then some automation which is done by the state machine and lambdas are coming to the game. It builds the new EC two image with the fresh image, with a fresh application, with all the updates installed, updates the dependencies, then provisions, the application runs the automated testing which is done by Lambdas too. And of course it's tightly coupled with the SNS and SQs. So basically with these services I have built the whole pipeline that I would be doing myself by hands. And when testing is successful I get full image that I'm ready to use and I get the callback from automation that hey, everything is good, we are ready to set. And then it makes the change request, I'm not involved. Meanwhile nothing breaks and it didn't break for the last year, it just works. And yeah, then approval team comes for me, receives the change of the ticket, sends to the specific endpoint in the separate account it says scaling group. Hey, this is your new image. I changed the template for you with the lambdas. Bot tree is amazing. And we have the new version and it's live and it takes into account the business hours. And basically this is what we might have if we just use some other ways of running the things. But since as I mentioned, we have amazing not provision time, we are very fast on having the application up and running, we are very fast on reacting to the changes or releasing the new versions because it's just like a river. If things are coming, they are floating and they are resulting in the outcome. Yeah, and what I also would like to say about the testing process, I test the application on the host and internally and externally just to ensure that we will not cause any downtime by the updating. Another thing that I would like to talk about of you today. So I think you like the surprises, but out of the box what Master wizard of creation of RDs on database doing. So you might have the GP to drive, but this is not what you want to have. This is a reason why you may have a very big surprise which actually looks like this. Just look at this birth balance, look at those huge waiting time on the database. This is crazy. And basically if you see this, it means your application is not working, but vibe burst balance. But you might ask okay, but what can we do with this? It's pretty easy. And thanks to AWS again they have introduced GP three, which is amazing drive which is able to provide you consistent performance without of any limits and it's even cheaper. And you know what I ops you have, you know what you see. But what is important about GP two, you might know that it has some certain amount of I ops and read out. But to achieve the performance, GP two must be stripped. And to achieve the stripping you must have about 100 gigabyte drive. And what is essential with the GP three, it works amazing on the small drives like assistant drive, 1020 gigabytes and it's cheaper. What is the most important and it's just running. And if you need a small drive for the database instance, because I o one, I o two, they are good, they are giving amazing performance, but they are expensive. And what you have to try is to check if GP three can uncover your needs. And of course what is good about the GP two. GP two might give you better performance if we are talking about terabytes. But it's not a reason to use it for your RDs, because you want to have your RDs alive. And the reason why GP two is bad, it causes I owe weights and drive is just inaccessible when you have running out of the credits. And if you don't monitor it and it can be just a huge surprise for you. AWS for administrator and why application is down. And this is the first things that you have to check if your RTS drive is encrypted. Because by default encryption is also not enabled and if GP two is not used, guys please just use the GP three and you will be happy. In the most of the cases going forward, as I mentioned, if we use the Aurora or RCAs, it's important to control all the credits for the CPUs and for the storage. Just to understand if you are covering the needs of your application during the running. And also what is important from my perspective, from my observability, that metric, so called metric DB load is also extremely important. So if you don't have the non SQL load, it means that you have the load caused not by your application, something goes wrong on the host. And mostly if you are running out of storage or you are low on RAM, or if CPU is just not enough to run your queries. Swap usage is actually a bomb that comes together with the GP two thing. Because if you are low on the memory and swap is running and you have GP two, you are just using your credits that might be used by your application. So you have the double burn of your credits and it's better to scale up a bit, give it a bit more memory rather than have the swap used on the database. Of course it can be a bit like 25 100 megabytes, but if you are talking about the gigabytes, it's a way to the catastrophe. Latencies are also important because the response time directly affects your application performance. As you can understand with the latency comes also the general database performance and number of connections is essential if you are working with the lambdas because every lambda is a new number of connection. If you have the steady application that opened the connection keeps the session. This is not an issue. But with the lambdas you have to keep an eye on it. So going forward and we have already changed actually the thing. So we have changed the drive for the RZs, we have get rid of the swapping, we have a bigger instance, but we still have the weights and we don't know what to do. There are a few ways of course one of the things that AWS promotes to be used for improvement of the performance is the usage of the red replicas, but it causes the application re engineering and your application must be ready for the things. And for example for this spring boot, if you use the native connectors, you can mark the data as read only in the scheme of your database. But you have to come to your developer and say do it. And they will say no, we have no budget. Then we can upscale RDS but we cannot be going up like for infinity. You can just find a good instance for you. But if it's not enough, what's then? And for example, some people may be confused with the RDS proxy. RDS proxy. So it's a good thing for the Lambdas because it does the connection pooling, but it will not improve your read write performance on the RDS instance and call Kevin. If you have a good technical account manager and good solutions architect account, maybe you can ask them. But yeah, it's better to drink beers together. So what's the solutions for this? Yeah, as I mentioned, read Replica goes to reengineering RDS. Proxies are amazing, but it will not help you because in this case it can help only by reducing the failover time up to 79%. And I really tested it. It cut it from five minutes to 1 minute. But you have to pay for RDX proxy. And if you will go outside of AWS and look around there is handle proxy. But why do you want to pay huge money for some proxy if you can re engineer your application for this money? Or how can you be ensured that it really does the correct and there is amazing thing that AWS has. They have Aurora and they have the Dynama DB. And if Dynamadb just stored the things for you, Aurora resolves the biggest issue of every database, that it's steady and not scalable if in our case the application site was not an issue, it was bottleneckled by the database things. And Aurora brings another very cool thing to it. But just before I switch to Aurora, I would like to mention that you may ask, how do I okay, I don't want to use the Aurora, I still want to stick to the RTS. But which RTs instance I would like to pick and I have done some long running evaluation of the data that I was gathering from the performance and I would say if you want to have the RDS instance, take the graviton instance. RDS is the best application of graviton that you might have and t they are pretty comparable by the Vram by VCPU number, but m six g, it has a bit newer graviton and it's more pricey. And the question for me was for example, what I would like to do with this. As far as you can see, t imsec six g they are giving amazing performance and you don't want to have the t three for RGs instance anymore because all of them are cheaper and running faster. So just forget about the t three. And if you are mastering the best credits and they are amazingly stable and you don't overuse them and you know, patterns of applications and your number of active users doesn't cause any issues, you can save up to 50 $70 by using T 4G instead of M six g. Despite this is a recommended production instance by AWS. And yeah, so you can see some additional metrics. I will just hold on here for the second, but sometimes the performance of graviton instances for the less price is twice faster than class 63 instances. Yeah, but as I mentioned, you can take the M six G as a starting point for your RDs. You can take the T 4G large if you are good with the usage, but all these things, they don't resolve. The most crucial thing that I don't like about RDS, that you have the time period when it's lighting, but you have the fully burning database and ready for lots of customers. It's basically like to have the heating switched on when nobody is living for the months in the apartment. It's just a waste of money. And this is what Aurora absolutely transparently resolves for us. And thankfully we are speaking about Aurora serverless V two and Aurora serverless v one. I have some experience with it. Thankfully it's already going to legacy and it was running a few outdated engines and you had to change the schema if you might be migrating for a more fresher RDS serverless v two. It had the matching engine that we already have for the RCs and all that we had to do, we had to take the snapshot and challenge the endpoint of the database. And you want to use the route 53 hosted zone for the database endpoint because you don't want to expose your DNS RDs endpoint on live, but with a route 53 private zone you can fix it. And if you might be migrating and trying out different databases, you can create something, for example mydatabasecon 42. Com. Then you will be fixed. And since we create new instance, your DNS record will be changed, but you will not have to update your for example DB viewer connection. You will not have to change your docker compose things or change the environmental variables to access the stuff. And this is just a free tip, not related. It's about in general about the databases on AWS end, what Aurora resolves, it eliminates the need of application re engineering. We can have already the main writer instance and read replica and Aurora will do the load balancing for you. This is simply amazing. And for example, in our case, application receives only single endpoint for the application. And then for example, application comes to me and says Dima, would you like to give me this information? And then for example, I'm pretty busy but I will just delegate the stuff to somebody else who is working with me and I will just give it back. And the application will not notice this because it's done by Aurora itself and you don't have to change your schema everything, you can just go up and down when you need. And in our case it appeared to give the same performance but with 35% less price for us. And I'm extremely happy with this because I haven't used any developer time to change it. Mimigration was absolutely good because it offers lots of mysql engines, postgres, whatever you would like, and you can even start building natively with the Auroras main engine. And of course you can go to the calculator, put the peak capacity and have some crazy values which are way bigger than your RDS bill, but you never use the peak capacity and this is what resolves with Aurora. So for example, this is the baseline of price for the M six G and this is the chart of real workday of our application on the Aurora. And as far as I can see, in some time periods to deliver the better performance that we might have with the RGAs, it goes up, but those peaks, they are compensated by the idling time and what is important about Aurora since it scales and if you are scaling pretty aggressively, for example, you are starting with a half of the unit, but if you are upscaling to four units or even more, then you should not expect to have the performance immediately. Aurora takes some time to be waked up and so probably you will not hit the same rate of performance immediately. But at the peaks it's absolutely comparable that what you have with the RGs, but you don't burn the money when it's not used completely. And there is an amazing example by the colleague Joe Ho. He did a huge observability for a few months with Aurora surplus calculation. So please jump in and you can use this example for presentation for your management and the colleagues. But he did a really big job and I can confirm on myself that this data is valid and with the right approach, Aurora be amazingly efficient. Yeah, so thankfully and sadly we are coming to the end. So in the whole process it's important to communicate with your team. You must understand the decisions, why it was made. In some certain ways. You must know the details because we can talk about the database. For example on AWS, on Azure, on Oracle cloud it might look the same, but only from the surface. And a good cloud engineer should know. So you might have the same task by, for example, as I mentioned, to have the application running. But yeah, it can be done in a different ways and same things. For example, GP two, GP three, they are basically EPs drives, but they are behaving differently. Or the Aurora and RDS details matter and be creative. You are the engineer, you are the artist, you are architect. Imagine if the whole world was built with panel houses. No, we have amazing buildings like Stefan's dome, we have efield Tower, we have Big Ben, because people, they are creative and this is a part of our work. Also, we are not game designers, but we must be creative to deliver amazing solutions, whatever is it. And with AWS you can have everything covered and done. And this is amazing and I'm really excited to work with such service and it's really cool. Thank you very much for the attention. I hope you have enjoyed the session. Please feel free to contact me on LinkedIn. I'm always happy to have any discussion, any tip for you. I can dive in into your case and maybe we can discuss some specifics. And if you can also have some recommendation for me, I will be thankful to hear your opinion, what you say and what can be done better. And please, on the left there is a QR for my LinkedIn. And please also check the rest of exciting colleagues who are participating in this conference too. And thank you very much to the mark for the invitation. It's a big pleasure for me to be here and also please write me an email or come to my blog which I'm about to launch very soon. And just to conclusion, thank you very much also for AWS community in the dark for all the support and we are hosting amazing events which is a community day dark which will take place in September this year so we are opening the registration very soon and I will be happy to see you here. Then we can discuss the workloads in the person and if you are looking for the user group in win please check out our meetup page. So yeah I will also be happy to see you here. So yeah, this is all for today. Thank you very much. All the best. Best of luck in your aws deployments and see you later in the clouds.
...

Dmytro Hlotenko

Cloud Engineer, Architect & CloudOps @ APA-IT Informations Technologie

Dmytro Hlotenko's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways