Conf42 Cloud Native 2021 - Online

Graviton2: How Honeycomb Reduced Infra Spend by 40% on Its Highest-Volume Service

Video size:

Abstract

How did Honeycomb have the confidence to migrate to a more efficient processor architecture, resulting in a 40% reduction in AWS spend on their backend ingest service?

In this talk, Shelby Spees explores how Honeycomb’s platform engineering measured the risk and ROI of moving their high-traffic ingest workloads to the new ARM-based Graviton2 processor architecture. Learn how high-cardinality observability data supported performing side-by-side comparisons of the same workloads on different processor architectures, and how to gain similar insights into your own infrastructure.

Summary

  • Honeycomb was one of the early adopters of AWS's graviton, two instances in production. The move helped Honeycomb reduce its infrastructure spend by 40% on our highest volume service. Shelby Peace explains how to make large migrations safe in your own systems.
  • Honeycomb moved to Arm 64 for its dog food shepherd service. Graviton two saw a significant improvement in tail latency under load. One snag early on was spot instance availability. Honeycomb learned from the experience and reverted the change.
  • And so allowing time for things to simmer and encouraging people to talk about these changes can be really, really helpful for your overall reliability and resilience. We'll be posting more about it soon. That's all I have for today.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
And welcome to adopting Graviton two how honeycomb reduce our infrastructure spend by 40% on our highest volume service I'm Shelby spees. I'm a developer advocate at Honeycomb, and you can find me on Twitter at shelbyspeace. You might be here because you're considering migrating to graviton two in your environment. Now. You don't want to jump into such a change just because it's the new shiny. You should have a measurable outcome that you're trying to achieve at Honeycomb. We'd originally heard about these new instance types during Andy Jassey's keynote at Remnt 2019. They were promising lower cost, better performance, and a reduced environmental impact. So how is graviton two different the armbased architecture is more efficient. It runs cheaper because more of the physical cpu did is dedicated to just doing compute. It also has less power consumption, plus it's faster because arm doesn't share execution units between threads across different virtual cpus. This gives you less tail latency and variability in performance. Long story short, we ended up making some waves as early adopters of AWS's graviton, two instances in production. I'm here to share with you how we went about this adoption, what we learned, and what you can do to make large migrations safe in your own systems. For Honeycomb, we wanted the cost and performance improvements of graviton two. Yes, but it would only make sense for us if we could get those things without hurting our reliability. So first we had to ask, is it even worth the risk to being able to answer that? We needed to think about what's important to honeycomb as a business. Honeycomb is a data storage engine and analytics tool. We ingest our customers'telemetry data, and then we enable fast querying on that data. At Honeycomb, we use service level objectives, which represent a common language between engineering and business stakeholders. We define what success means according to the business and then measure it with our system telemetry throughout the lifecycle of a customer. That's how we know how well our services are doing, and that's how we measure the impact of changes. Slos are for service behavior that has a customer impact. So at Honeycomb, we want to ensure things like the in app homepage should load quickly with data, that user run queries are returning results fast, and that the customer data we're trying to ingest should get stored fast and successfully. These are the sorts of things that our product managers and our customer support teams frequently talk to engineering about. So once we have our SLO defined that allows us to calculate our error budget. How many bad events are we allowed to have in our time window? Just like with a financial budget, the error budget gives us some wiggle room. Finally, we calculate how fast we're burning through that error budget. The steeper the burn rate, the sooner we're going to run out. So now we start thinking, how soon are we going to hit zero? This allows us to alert proactively if something is causing us to burn through our budget faster, so fast that we're going to run out in a matter of hours. That's when we should wake somebody up. And so now we're alerting on things that we've decided are already important to the business, and we're doing it proactively. So thanks to the work that we put in to measure how our services were doing, we started having a period of pretty stable reliability. We're a small startup, so once our reliability goals are met, we care a lot about cost. And because we're a SaaS provider, our infrastructure is our number two expenditure after people. So infrastructure is something that scales with our production traffic. And since our sales goals involve landing large customer accounts, we want to position ourselves to be able to support the traffic increase associated with landing more big accounts. Technical decisions are business decisions. It affects our ability to maneuver in the market. So, having defined our goal and having confirmed our ability to measure it, we needed to decide on how to proceed. How can we safely experiment with an entirely different processor architecture? At Honeycombs, we have multiple environments. We deploy the same code, the same git hash, and it's running across all of those environments. Our production environment is where customers send their data in order to observe their own systems. And we have a second environment called Dog Food, where we send the telemetry from production honeycomb. Like I said, it's not a staging environment. It runs the exact same code as production. Dog food, allows us to query our production data the way honeycomb's users interact with their production data. It's a great way for honeycombers to build user empathy, and it also allows us to test out experimental features on a safe audience. So we'll often enable feature flags in dog food for testing internally. Then we have a third environment called Kibble, and that's where we send our telemetry from dog food. And so for the experimental features we've enabled in dog food, that telemetry gets sent over to kibble. So kibble observes dog food, dog's food observes prod. And one thing to know about honeycombs on the outside. We're bee themed, but you might get the sense on the inside we have a lot of dogs, so we have a number of different services, and the biggest ones include shepherd, our ingest API service retriever, our columnar storing query engine, and poodle, the front end web application. So we really stuck to this theme. For graviton two, we chose to try things out on shepherd because it's the highest traffic, but it's also relatively straightforward, it's stateless, and it only scales on cpu, and as a service it's optimized for throughput first and then latency. So we have a place to start. What's next? Well, we needed new base images, we needed to check that our application code was compatible, and we needed to make sure that our existing CI would produce build artifacts for ARm 64 honeycombs a Go shop. So it turns out we just needed to set Arm 64 as a compilation target in the go compiler. And the compiler can handle that for us, even if it's not compiling on an arm box. And so we updated our Circleci config to include a build step for the Arm 64 target. I will say, though, you do need an arm machine to efficiently build arm compatible docker images. At Honeycomb, we don't use docker for any production workloads, just for internal branch deploys, so that wasn't an issue for us. For other shops, if you're on Java or Python, your binaries are already architecture independent. But for example, if you're running c with some hand assembly, you might need to update a few things. My teammate Liz initially started out all this experimenting as a side project, and with a few idle afternoons of work, she was already seeing compelling results. So at that point we set up a b testing. We started with one graviton instance in a shepherd autoscaling group in dog food, and from there we could update our terraform code to spin up more graviton instances, as we felt confident in each change. When we reached 20%, we let that sit for a couple of weeks to observe, and here's what we found. Our graviton instances here in the second row saw lower latency overall AWS, well as less tail latency. The median latency was more stable across different workloads, and about 10% faster than the old architecture. On the old architecture, cpu utilization would max out around 60%. On graviton we would get closer to 85%. So we got better utilization per cpu unit and zooming out a bit. Here's the overall migration in dog food shepherd. These graphs show our total VCPU usage at the top, and then at the bottom, it's showing the number of dog food shepherd hosts. So you can see the big cutover in midaugust. And from there, we did some tuning to figure out that sweet spot where we're getting the best mileage out of those cpus. This is my favorite graph so far. This is our cost reduction in dog food food. So for our dog food shepherd service, we saw pretty compelling results. And at that point, we decided we're ready to roll out to production. So what happened next? We felt confident about shepherd, so we migrated, prod shepherd, and saw a similar cost reduction for retriever. We didn't care so much about reducing costs. What we wanted to improve was performance. We care about fast querying. So it turns out that we could opt to spend a little bit more to get double the number of cores. Since each arm 64 core is able to handle 50% more load than on equivalent intel chips, we ended up with triple the performance. So once we were already all in on graviton two, migrating retriever was a no brainer. And retriever immediately saw a significant improvement in tail latency under load. Those weekday bumps just totally flatten out on the P 99 graph. Zooming out again, our traffic volume has increased significantly over this past year. We're approaching triple the workloads on retriever compared to when we started. But look, our tail latency is staying the same. It's just holding steady. So that's fantastic. One snag we did encounter early on was spot instance availability. When we started scaling up in prod, we ended up using all the n 60 D instances available in spot. So we paused our migration and Liz ended up reaching out to the graviton two team, and they were able to shift capacity for us within a few hours. So then we were back business. Another thing that happened is on our Kafka cluster. Kafka sits between shepherd and Retriever, and it allows us to decouple those two services and replay event streams for ingest. So we were actually testing Kafka on Graviton two. We were so early, we were testing it before even confluent had tried it on the new architectures, and we're probably the first to use it for production workloads. And we ended up changing too many variables at once. We wanted to move to tiered storage on confluent Kafka to reduce the number of instances we were running. And we also tried the architecture switch at the same time. Plus, we introduced AWS, nitro, and all of these variables at once. That was a mistake. So we've published a blog post on this experience, as well as a full incident report. I highly recommend that you go read it to better understand the decisions we made and what we learned. So we've reverted the Kafka change. And we also have this long tail of individual hosts and smaller clusters that we'd like to migrate. But four of our five biggest services are fully on graviton two, and here's what that looks like. Those services make up the vast majority of our traffic, and we're really thrilled with the cost savings, the performing improvements. Plus, it feels great to be able to say that we've reduced our environmental impact as well. So here are some things I hope that you can take with you. The most important thing to remember when considering a significant technology migration is to have a goal in mind, something that's measurable, so you know, whether or not your change was successful, you need to be able to compare your experiments to a baseline. Slos are a really great way to approach this. Another thing to keep in mind is that there's always hidden risks. We're lucky to have Liz's expertise and sense of ownership, and I think that's really important. Part of being an early adopter is making sure you have that expertise in house. But we still ran into some snags, like Amazon running out of graviton two spot instances. So we're lucky that we were able to make friends with the team and talk to them. But it does add more variables and potential silos. We did have a lot of luck with terraform, cloud, and circleci. It smoothed out a lot of the experimentation that would normally be manual clicking in the console, and so we could point to individual changes and figure out what to revert. But all of these hidden risks have a human impact. And in general, it's important to take care of your people. Incidents happened, and we're lucky that we had existing practices that helped a lot. We encourage people to escalate when they need a break, when they are starting to feel tired. We remind, or sometimes we guilt people into taking time off work to make up for off hours. Incident response. Another thing that came up recently is that people had to responding to incidents. Couldn't cook dinners for themselves, they couldn't cook meals, and they couldn't cook for their families. And so it was almost this no brainer thing that once somebody said it, of course people should expense meals when they're doing incident response for themselves and their families. And so we made an official policy about that. And in general, I think it's good to document and make official policy out of things that are often unspoken agreements or assumptions so that everyone on your team can benefit and feel very clear in those decisions. One of our values at honeycomb is that we hire adults, and adults have responsibilities outside of work. So you're not going to build a sustainable, healthy, sociotechnical system, a sustainable, healthy team, if you don't account for those responsibilities outside of work. Take care of your people. Finally, optimize for safety, ensure that people don't feel rushed, and remember that complexity multiplies. So whatever you can do to isolate variables, create tight feedback loops. And then just keep in mind that even as much as you do, that things are going to intersect in unexpected ways. Complex systems fail in really unexpected ways. And so just acknowledging that sometimes things are going to take longer than you would like, we plan to eventually migrate kafka over to the graviton two, but we didn't do it in the time range that we wanted to. And that's okay. So just keep that in mind. Another thing is that isolating variables makes it easier for people to update their mental models. As changes go out, it really helps to get everyone talking to each other. And so allowing time for things to simmer and encouraging people to talk about these changes can be really, really helpful for your overall reliability and resilience. If you'd like to read our graviton two posts on the Honeycomb blog, you hear a couple of links and we'll be posting more about it soon, and you can download slides at honeycomb IO slash Shelby. Also, I'd love it if you reached out to me on Twitter. That's all I have for today. Thank you so much.
...

Shelby Spees

Developer Advocate @ Honeycomb.io

Shelby Spees's LinkedIn account Shelby Spees's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways