Conf42 Cloud Native 2025 - Online

- premiere 5PM GMT

Cloud Compute Cost Showdown: How We Saved Millions Optimizing AWS EMR, Databricks workloads

Video size:

Abstract

Cloud costs spiraling out of control. In this talk, we’ll break down how we saved millions optimizing AWS EMR and Databricks workloads without sacrificing performance. Learn realworld strategies, cost benchmarking insights, and best practices to maximize efficiency and slash cloud spend Dont miss it

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone. I'm Shesanath Bala Venkata. I'm a senior manager and I build data products and that takes care of data engineering with Comcast. So today's talk is about cloud computing cost shutdowns, how we will save millions of OPEX dollars optimizing AWS EMR and Databricks workloads. So let's dive into the talk. So the agenda for today's talk is We'll dive into what is Databricks and what are the Databricks cost optimization. Then what is AWS EMR? What are different types of AWS EMR features is available and offerings are in market. And we'll look at what are the AWS EMR cost optimization. So what is Databricks? So Databricks is a cloud based unified analytical platform, which is built on top of Apache Spark. so Databricks build, bring, ease of use on top of Apache Spark, where it has like a lot of, UI capabilities on top of it and it has, it can leverage for data engineering and machine learning and huge, data, big data processing and the key features. Of the Databricks are Apache Spark optimizing. So in, if you see like open source, Apache Spark was a Databricks version of Apache Park. It's more optimized and you'll see lesser errors running, huge workloads. you'll not see the workers dying in middle when running on, Databricks. And it offers, one a feature called the data Lake. Where it, provides ACID compliant on your data set. Like it's more of a OLTP offering from Databricks. So Delta is more of like, whenever there is a change happens on your, system, it will capture that and it will maintain the metadata for it. And you can pull the transaction information and it supports the ACID. And it supports, cellulose compute as well. This is a new feature, which they launched. And as I said, Databricks can be, offered on any client, cloud, provider like AWS or Google Cloud or Azure. So right now, there's three offerings are there and this Databricks will provision on top of it and you will process the data for your computer analytical, combinations. And it offers a collaborative notebooks where you can go and write your notebooks and you can deploy that. Whether it's for data engineering or for data science or ML workflows. As I said, it's a multi cloud support, like where you can have AWS Azure or Google Cloud. that's the key features. if you look at the bottom there is a picture for where it has, Databricks, and it will do the governance with all these, cloud providers, and it provides a business intelligence, data warehousing solutions, AI, ML, data science solutions, ETL, and real time data processing and, orchestration, within their tool. why we will use, Databricks? right? So that's a good point. one of the biggest advantage with Databricks is like cost optimization, where you can run your workloads using Databricks and you don't have to worry about provisioning these clusters, how, where to get these resources and all of that. You simply focus on your business use case and Databricks. We'll take care of orchestration and provisioning of these resources. It's a high performance. So one of the thing is, like I said, it's an optimized version of Spark. And it has their engine called Porton engine, where it runs a SQL and Spark updates for particular use cases. Not all use cases are supported with Porton, but that's really extensively a performance compared to General Spark and it's written in C So it, it optimized the memory really good. Then, it's a collaboration friendly where you will write your notebooks and you can share across engineers and it can, in real time you can do, real time development and seamless integrations with all these cloud providers. So that's on a high level, like what Databricks is offering right now. Then, we look at, Databricks optimization into like multiple layers. It's not like one simple, Point where you can optimize there are multiple ways we have to do it So the first one is like we need to review EBS usage. So what is EBS, right? So whenever you launch your Node or cluster the data when it processed right the whenever it goes out of the memory to start writing to the disk So most of the nodes comes with the internal disk itself. If it is not there, it'll use the AWS EBS, storage. So most of the time it's a default storage available on the nodes. So if you look at, disk storage is used for shuffling, as I said, and checking your instance type, allocate disk is very important. So when you use i3, 4x large workers has 3. 6 TB of local storage. So we don't have to use. EBS volumes if you are using inodes, but still people go on provision EBS volumes where you're paying additional cost of it. So EBS cost for job is like storage cost plus IO cost and snapshot cost and data transfer cost. Everything comes into EBS. So it might look very small initially, but when you're learning a huge amount of jobs and we are paying in millions, this one contributes a huge. And when you can tweak this one, you can save up to 50 to 80% of your, savings, understanding, auto scaling pitfalls, right? So saving up to 28% is, is possible. So basically like whenever we have fluctuation in data and it's seasonal, right? We don't like how much data we'll get in. So normally what we do is like we'll go and enable autoscaling, but whenever job goes on Autoscale, whether it's a upscale or downscale, it keep on calling the config where. Keep the calling for config is a cost related thing, right? So don't use this auto scaling unless it is really required. If it is, if you know the history of the job, keep it simple. Keep the, not dynamic and static resources. That way you can save on the auto scaling. Reduce AWS config cost. This is like subsection of auto scaling. as I said, like previously, where whenever you keep on calling this AWS config, From Databricks, it, it creates a lot of cost, recurrently for upscale or downscale of your resources, whether it's EBS volumes or increasing the, adding the nodes or reading the nodes, anything, right? it's always, uses the cost and it'll, it's a bombard to the job. So it's always recommended to use, right appropriate instance. That's the next point. So most, more appropriate workloads from M node to I nodes, if it is like where you can get like internal volume, where it'll help, with the EBS volume on top of it, you can know what's the cost savings or cost. Breakdown for M node versus I node, right? So choosing the right instance is the key here. we miss a lot of savings by choosing the wrong instance. for example, M5, 4x versus I3, 4x large. both has it own, you need to understand like why we are choosing this instance rather than simply blinding, and going with the, recommendation. So we need to have like really good understanding of choosing instance. The last one is spot versus on demand. most of the workflows, it depends upon if you are running a higher SLA job, then you can run on demand. If you are having a lower SLA job, bank on. A cheaper, spot instance where you will go and bid for the instance and you'll get it. But the chances AWS can terminate if they get a bigger bid is high, but still you have a lower as a job, which can be resilient if you restart. And if you run one more time, still you will save money. So this happens very rarely in AWS instance where someone comes and grab your resources. But still you can save a lot of money on that. so there are a few other Optimization challenge things which we can do right so using the right number of shuffle partitions So the key with any of your workloads particularly the data workloads is like we should not have The shuffle and the amount of data written to the disk and amount of data, shuffle or the IO should be very minimal. So we need to look at like, when you're doing huge amount of aggregations, like group by joints or reduced by key, the key should be distributed across all the nodes. The power of distributed computing can be utilized to the fullest. So the partitions is very important. How many shuffles we can do is very important. Most of the engineers know about this one, but most of the time they overlook this one. And we need to look at the UI and see like how much data is being spilled, how much data is being shuffled over the network and all of that. Spark has a really good UI to showcase all of that. So it's always recommended to have two types of number of cores. That keep it up, keep that as your shovel partition. That way it's like you want to distribute across all the nodes and caching, right? Like people overly use caching, even though it's not required. so Spark is a DAG where it recollects everything. So if you want to cache it in middle so that it'll read, it will not recollect the entire DAG. look at the, UI to understand like how much activity is happening with, garbage collector, how much cache is required and all of that, and use cache properly. And, duplicate is one of the biggest problem where, you may see all the data going into one node, or maybe the key is not unique. So there are multiple ways you can, reduce a duplicate before you process it, get all the duplicates into one data frame, process all remaining everything and just union it. There are like multiple ways you can do it. One of the technique is called salting, where you will take the key, add a unique value to it. That way you can make sure that the key is distributed across all the nodes and it can process it. So this kind of, there are a lot of, ways you can eliminate the duplicates and reduce the cost in running these jobs for a longer time. And, the last one is establish a robust, data life cycle and cost system. So different life, life cycle policies on this, data sets and how much these jobs need to run and all of that. That's very important. And you need to keep monitoring these jobs. So sometimes like we might think okay, this might be using this much if it is having enable with auto scaling, all of that. We don't like all this. small nitty gritties of what's happening with the job breakdown. So we need to look at all of that and keep a key and that and keep an alert on it If it crosses that threshold hit alert and start looking at like why the job is performing like that. So let's now Dive into what is AWS EMR. So AWS EMR is a cloud based Databricks processing offering from AWS where you can run Spark based or Hive based jobs on EMR, right? it's equivalent to, like, when Hadoop started. they released this EMR Elastic MapReduce where you can launch your MapReduce jobs or Hive jobs on it. And it'll run on EC2 instance and it evolved over the time and a lot of companies and a lot of teams use this one for different purposes. So there are three flavors to, AWS EMR. I'll talk about the difference in a minute, but I just want to explain like what are the three flavors we have or three offerings we have on, AWS EMR. One is like the legacy EMR on EC2. That's when you have a large, large scale, workloads and you need to process all of that. You will use, EC2s, but like the cluster management is like very tedious, provisioning, scaling, and monitoring is like very hard if you want to, if you're doing with, EMR on, EC2. Then the second offering is, EMR on EKS, where you'll launch your EKS on EMR, on EKS, like electric, cu Kubernetes services, where you'll run all these jobs over there. the problem with this one is like it's more complex to set up the entire, infrastructure. So that is a problem, but like the scaling up and managing the workflows and running the workflows is like very. you can do it at a scale. The third one is EMR Ferola. This is a new offering which started hitting the market from last couple of years where you can do on demand, pay for the usage processing. The only problem with, several CM R is you'll not have much control. if you want to tweak something or if you want to change some configuration is little td little, complex. So apart from that, serverless, CMR is a game changer as of now. as I said, if you look at, the differences, right? The coastal, the cluster control. On EMR on EC2 is little high, then EKS is medium, then serverless is like pretty much low, but we don't have much control on that. Scaling wise, manual, you can automate on EMR, EKS you can automate and serverless EMR is like completely fully automated. The best for large scale UTL, for EMR workloads, AML container kind of jobs. Ad hoc or busty jobs where you'll use, serverless CMR cost efficiency. if you look, if you are going for, this is where the key cost savings, you can see most of the workloads, which we have migrated from server EMR to, server serverless, sorry, EEMR on E, EMR, uneasy to do so as EMR, because that's where we have seen a lot of cost savings. And, you'll pay for what you use, right? That's the same case for other two as well. But like in serverless, you don't have to worry about it. It's pretty much following the model of Databricks where you don't have to worry about like the provisioning and all of that. You just say what you want and focus on your business use case and AWS will take care of it. As I said, ease of use, obviously. As always, EMR wins the game, but, EKS is more complex, but it has if you read the provision it's to wonder, and, EMR on EC2 is like a traditional thing where it's like moderate versus hard ease of use. we look at how we can save. On, AWS EMR cost optimization. So the right size cluster based upon the workload, patterns, right? So that's the same kind of thing. What we were talked previously with, Databricks as well. Like you need to choose the right, worker type, and like node. To provision your jobs. And one of the biggest thing is like the Graviton based instance, reduced by 20%. That's a game changer as well. Because this Graviton based is a kind of chip, which is built by Amazon. And we, they don't have the property to pay anything on Intel based chips. that's a game changer for AWS where the Graviton, runs much faster and they don't have to pay anything on top of that. And they give the savings to the, customers. Dynamic autoscaling for efficient resource allocation. As I said, the similar kind of things what we have seen with Databricks, like we should be very careful with autoscaling where your jobs are being processed. So we should optimistically use the autoscaling. We don't have to do it. Keep it normal, keep it static, and don't have to enable autoscaling. The third one is, optimizing spark performance, as I said previously, shuffle is very important where you will not do a lot of IO over the network or will not write any dis data to the desk and where you don't have to read anything, that, that kills the, importance of Apache Spark. So look at that one. Dynamic allocation of Spark executor, ensuring optimal usage per job, need to be implemented and parallelism settings need to be optimized so that your, Then, as I said, serverless EMR was a CKS compute. we, what we have seen is like, if you have a, based upon the job type, if you have a, job type, which is like lower in SLA and don't need, it's fine if it fails and restarts. And, based upon that kind of thing, you can select what kind of, EMR you can choose. But like we have seen serverless EMR saves a lot when you're tweaking and running on it with Graviton chips. So we recommend trying serverless EMR in that instance and data storage and access optimization. Like S3, when you read and write from S3, there is a cost to it. And the asset, like how Delta is offered, AWS. It offers who do your eyes work for? Asset, compliant, datasets. optimize that to the fullest because like whenever we have the asset file system, like the files keep on growing whenever there's a, transaction happened on it. So we need to vacuum it. We need to clean it and we need to keep it like. Certain versions so that it will not create huge amount of data. So that's again a game changer there Adjusting the retentions on this s3 objects not just leaving it alone Will do a huge cost savings then automating cluster shutdowns and scheduling. We have seen this huge this issue with AWS emr particularly Where the cluster will be up and it will never be shut down or it'll be kept on up. We keep on running even though there is no workload running. So we should be keep on, as I said, like k alerts on all these workloads. If it crossing certain threshold, it should be alert and we should be debugged and see what, what's really going on with the SLA and what is going on with the, TCO reports. So we need to look at that to identify what's going on there. So that being said, hope the talk is. giving some information of like how you can optimize your cloud workloads. And, if you have any questions shoot me and,
...

Seshendranath Balla Venkata

Senior Manager, Data Engineering @ Comcast

Seshendranath Balla Venkata's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)