Reliability Nirvana

Video size:

Abstract

You’ve heard of event-driven architectures but what do they actually look like?

Learn what is involved in building and operating an event-driven distributed system and why your next high-scale project should make use of this exciting tech.

Summary

Dan will talk to you about how to achieve reliability nirvana by utilizing event driven design. Everything that I talk about I have done in production, none of this is theoretical. My personal goal for this talk is that I just want you to walk away having learned something.
Dan: I love building and operating distributed systems. Most recently, I co founded a company called Batch. We focus on providing observability for data that is usually found on high throughput systems such as Kafka or RabbitmQ. Originally from Latvia.
We want to have predictable service failure scenarios. We also want some sort of self healing at the service level. But if you want to go a little bit higher, it gets exponentially harder. It's akin to moving from vms over to containers for the first time.
Wikipedia defines event driven as software and systems architecture paradigm. It promotes the production, detection, consumption and reaction to events. The event bus of my choice would be Rabbit MQ 100% of the time.
The fact that you are no longer relying on any other services around you. There is a single service that only cares about itself. That means that your failure domain is really small. Well defined development workflow. Security is going to be super happy about it.
The most important one by far is having some sort of an event archiving solution. It is absolutely amazing to implement an event driven platform, but only if you have a complete understanding of everything. Do not try to do it in one big fell swoop.
Sre is 100% event driven. The only folks that truly understand things, how they work at a platform level are sres. You absolutely must have a written culture in place with an event driven system. Everything should be written down. Security will totally thank you.
You should totally start off with Martin Fowler's event driven. This article talks about the fact that event driven actually contains multiple different architectures. It's just a good overview really of event driven in general. But to get there you should probably spend a little bit more research and reading on this topic.
Do not rely on ordering. Create your system so that order doesn't really matter. That's like where item potency comes into play. High throughput observability is not really a thing yet. If you are thinking about going to event driven, come talk to me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are youve an SRe? A developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus cloud hi folks, I am Dan, and today I'm going to be talking to you about how to achieve reliability nirvana. And we will do that by utilizing event driven design. So let's get to it. First things first, let me just give you a quick disclaimer. So everything that I talk about I have done in production, none of this is theoretical. This is stuff that I have personally done. And if I haven't done some of it either in production or maybe I've just done it in staging, I will let you know that that's the case. I also try to keep it real, meaning that if there is something that is really difficult, I will let you know that it's actually difficult. And some of the stuff that looks difficult, it is actually easy. I'll let you know about that, too. Really, the final piece is that my own personal goal for this talk is that I just want you to walk away having learned something, right? Just have something tangible at the end of a talk. So with that said, let me talk about myself first, because every single talk is supposed to have at least one slide about the presenter. So let me be a little self indulgent here. So I am Dan. I reside in Portland, Oregon. I have worked in the back end for about ten years, probably more this point. I've been saying that ten years for a while. So I love building and operating distributed systems. I figured out a while ago that I'm really into the architecture part and the design portions of distributed systems. It's really fun and really interesting. I was previously an SRE at Neurelic. I was an SE at Envision Digitalocean community. I also spent a lot of time in data centers as well, basically wiring stuff up well, gluing things together, really, between systems and software. And most recently, I co founded a company called Batch, and we focus on basically providing observability for data that is usually found on high throughput systems such as Kafka or RabbitmQ, et cetera, essentially message brokers. And cool fact is that we got into Y combinator, which is pretty sweet. And one fact that you can immediately take away with you is that I am originally from Latvia and there are at least literally dozens of us that like distributed systems and are from Latvia. So there you go. You've already got one thing, you know. All right, so what is reliability nirvana? Well, it is not being woken up at 03:00 a.m. On Saturday night or really any night at 03:00 a.m. I do not want to be woken up at all. We want to have predictable service failure scenarios. We want to have well defined service boundaries. We want to be security conscious. We want to have self healing services, and we want to be highly scalable and highly reliable. Of course, now you might be saying that, well, we already have this whatever tech. And yeah, you're right, we do. We have the microservices pattern that deals with the monolith problem that we've had of being able to decouple things and basically slice them up. And after we slice them up, we slice them into containers, which is perfect. We're now able to have reproducible builds. We have a really nice dev flow because everything is in a container. Now. We can make use of container orchestration, such as kubernetes or mesos and so on, to tackle all the concerns around lifecycles. Right, the lifecycle of a service. And then we have things like a service mesh, like service meshes which ensure that our services are able to talk to each other correctly and so on. And there's probably a whole slew of other things that you might have that are going to help you achieve even better reliability as well, such as, say, APM solutions like new Relic and Datadog and honeycomb and logging platforms and so on. There's really a ton of stuff like tracing, right? Like request tracing and so on. Things have gotten pretty good overall. Really? Right. So what is the overall problem if things are already pretty great? Well, the issue is that if you want to go a little bit higher, if you want to go and basically take the next step and you want to achieve even higher reliability, it gets exponentially harder. It gets exponentially harder. Case in point, if we're just talking about the microservices pattern itself, if you have, I don't know, 2030 microservices or something like that, you have a massive failure domain. You actually have no idea how big of your failure domain is because one service might actually take down six other services and cause three other services to become really, really slow and so on. So the usual approach to solving this sort of a thing is to say, well, we're just going to employ circuit breakers everywhere. But as it turns out, with circuit breakers, it is really easy to shoot yourself in the foot as well. And they're notoriously difficult to configure and get right it's not. The fact that circuit breakers are difficult to actually implement, it's to get them right is actually really hard. For those of you not familiar with Histrix Histrix style circuit breakers, the idea is basically, they're essentially patterns, code patterns to introduce fault tolerance into your requests, right? So when you're making an HTTP request, if it fails, then it's going to be retried automatically for you exponentially for however many times with an exponential backup and so on. In my experience, when it's been done, it is very easy to not get it right. And in some cases the shooting yourself in the foot part is basically misconfiguring in such a way that basically that it trips when it's not supposed to. So when the service is not actually down, something will happen and will cause it to trip and thus real requests are going to get dropped. And at the same time you also have to figure out how to avoid cascading failures. But the good news is that there's always somebody who is going to say, well, my service knows how to deal with that and it's going to prevent everything else after it from failing in a cascading fashion, which is of course totally not true. And whoever that is, their service g is actually maybe not that great. So at the same time, we also want some sort of self healing right at the service level, meaning that just because we have auto scale in place, what happens to that request, which was mid flight in the middle of the auto scale event, does it get just dropped? And are we just simply okay with the 0.2% of a failure rate that we're having because of a deployment or something like that? So that by itself already is hard to achieve as well. And at the same time, you need to keep security in mind. And as we all know, pms just absolutely youve it when you are spending time on not shipping features and just working on something that may seem to them as kind of useless. So point being that getting to that next level afterwards is really hard. It's akin to moving from vms over to containers for the first time. It's a fairly big deal at that point. There's not unfortunately like some sort of a silver bullet which is just going to automatically solve all these problems. So what you might be seeing here is that there is a pattern is starting to evolve, like emerge here. We're talking really about services. It seems like the infrastructure part and the platform components and so on, they're actually fine. There's nothing really inherently wrong with them as a matter of fact, the microservice pattern is great. Kubernetes is great, Docker is great. All these things are really fantastic. So the place where we need to put a focus on is on the services themselves. So how would we achieve that? Well, we would want to have some sort of a solution which ensures that maybe the services shouldn't rely on each other anymore, meaning that service a does not have to talk to service b. And as a matter of fact, none of them should do that at all. Right. We would also want a situation so that developers do not have to write code specifically to be able to deal with a situation where the server is coming back after it has dropped a bunch of requests. Right. Not having to write these sort of fairly complex fault tolerance systems and so on in place. Similarly, we do not want sres. What would be really awesome is that if sres did not have to write any sort of per service firewall rules and just punching holes in different places whenever service e needs to talk to service x and so on and so on, it would be nice to just be able to put a blanket rule down and it just simply works right. And then on top of that, it doesn't even have to be touched again at some point in time. A nice bonus to all of this would be that we would be able to investigate every single state change that takes place as part of a single request. To some extent we can already do that, because if we have request tracing and so on, you could potentially do that. But again, it's one of those solutions where while not everyone has request tracing, it would be nice if something just basically came for free, that we were just able to get it. And finally, a really big one is that it would be super awesome that all this really concrete, very specific systems data that is flowing from somewhere which is representative of the current system state would be available ultimately for your future analytics uses. So, meaning maybe you don't have a data science team right now, but maybe you will have in about six months or something like that, being able to show them that, hey, you can hook into here and you can see all the messages that are there. Everything that's ever transpired on our systems would be super awesome because they could then actually build various systems to analyze the data, predict things and so on, and create dashboards and whatever other cool stuff that data science teams do. I'm not a data scientist, as it turns out. This is kind of what I do instead. So what you might be noticing here is that reliability nirvana is actually not just service reliability or systems reliability. It's actually effortless service reliability. We want something that is able to go the next step where I do not have to spend time building something that is just for a singular purpose only, such as implementing some sort of a circuit breaker pattern. Well that gets me closer to better reliability. But it's not the end all be all. It is just one piece of the puzzle. And what youve might be seeing here is that what I'm going towards is that there kind of is a solution which is able to address all of those things. It is not easy by any means. However, it does exist and it is totally doable. You can actually implement it and it is totally possible. But there are certainly some caveats and that's the sort of stuff we're going to get into. But before we do that, let's define what is event driven, right? So the Wikipedia article, the Wikipedia definition says the software and systems architecture paradigm. The Wikipedia entry defines event driven as software and systems architectures paradigm, promoting the production, detection, consumption and reaction to events. So really what it means is that really there's just three actions here. It's something emits an event or a message. Something, oh, I added an extra slide, sorry about that. Something emits an event, something consumes an event and something reacts to an event or a message, right? That's essentially all there is to it. And it's always going in that pattern after the reaction. It might not emit an event, but it might just simply continue consuming an event and so on, consuming other events. Secondly, your event must be the source of truth. This is an extremely important aspect of event driven, is that you no longer want the entire system to know, like all of your services to know the entire state of the entire system. You want your one particular service to only know its own state and it shouldn't care about any of the states of anything else happening. As a matter of fact, there shouldn't be a single service that knows the entire state of anything. It should be just made up of single services that only know about their own state and nobody else's. That is an extremely important point. You of course want to communicate all your events through a message bus of some sort, like a broker, kafka, rabbit, MQTT, whatever it is. Another super important point is that you want everything to be completely item potent. And that sounds complicated, but it's really not. All it basically means is that your services are able to deal with situations where it might receive a duplicate event and it might receive a couple of events that have come out of order. And the way you deal with that is you simply keep track of the events that you've taken care of already. And if it's the same event coming through again, well, you just ignore it. And if it's coming out of order, well, then you apply business logic to that. Do you care that it's an older event or that's appearing after your future event has already happened? Maybe you don't. So you could just simply discard it. It's not terribly difficult to pull it off, and it sounds much more impressive, really, as written down, than it actually is. And ultimately, you must be okay with eventual consistency. It is just a fact here is that you're essentially trading really high availability and high reliability for this eventual consistency thing, and you're basically exchanging one thing for another. In this case, you will no longer be able to say that you have guaranteed consistency because you do not. You will probably have 99.99% consistency. However, there is no longer a way for you to guarantee that. You can just simply say that right now the system is mostly correct. But there's no way to say that it's 100% correct all the time. You just know that it will eventually become correct. So the components that actually make up an event driven system, of course there's an event, but the event bus of my choice would be Rabbit MQ 100% of the time. Well, 95% of the time. Let's say that Rabbit MQ is, for the most part, what are the event driven components? So the event driven components, of course. Hold on. So let's explore the event driven components, or let's explore what makes up an event driven system. Let's look at all the components that are involved. So, number one, of course there is an event bus. I would 100% choose RabbitMQ for this. RabbitMQ is extremely versatile when it comes to the sort of things you can do with it, rather than having to reinvent certain functionality, such as, let's say if you took another message broker, let's say like Kafka, you would probably have to basically build a lot of your own stuff. For example, dead lettering. That doesn't really exist in Kafka, so you would have to build it yourself. Routing based on headers. That sort of a thing doesn't exist in Kafka, but it definitely does exist in RabbitMQ. So by utilizing something with such a versatile way to do routing and just utilizing a system that has so many different features is not going to pigeonhole you into designing a foundation that might be a little less than perfect. Besides that, Rabbit MQ is decently fast. It's able to do upwards of around 20,000 messages a second, which is not too bad. I think it's fine, especially for something that serves as your internal event bus. Essentially, if you needed more than that, then you would probably have to go to something else. And RabbitMQ unfortunately is not distributed. You can scale it pretty much vertically before youve going to have to go to sharding. So at that point in time you probably may want to look at some of the tech, but for all intents and purposes, in most cases, folks that think that they need Kafka, they probably don't need Kafka, they probably can go with Rabbit. You're also going to want to have some sort of a config layer or a caching layer, and that is basically going to serve the purpose for each service. It's going to be like essentially a dumping ground for each one of your services to be able to write some sort of intermediate state to it, right? So as a service, when we're talking about for instance, even item potency, the services will want to record at some intervals that like oh, these are the messages that I've already processed so that in the case that the service gets restarted, it is going to be able to pick up right where it left off at for this purpose. SCD is fantastic for it, etCD is distributed. It is really rock solid. When I say highly latency resilient, what it really means is that you can stick them with 100 and 5200 milliseconds of latency between links and ETCD is going to survive without any issues whatsoever. It is decently fast. It says that it's about 20,000 messages a second. I have never seen an ETCD with 20,000 messages a second. But let's just say it's probably not what youve should be shooting for anyways. You should be shooting for probably even under 1000 a second. And the other thing is that eTCD is used heavily by Kubernetes. So the chances of it going away are pretty much, well next to nothing at this point. You're also going to want to have someplace to store all of your events. Everything that has ever happened on your Rabbit MQ, you're going to want to but it somewhere and that place. There's really nothing much better than s three. If you have the ability to use something external, meaning don't have to run it yourself, such as mini or ceph or something like that, then s three is fantastic. Because it's super cheap, it is fast and it's plenty reliable. In some cases you might experience some hiccups with trying to write to s three, but overall, if you get around that, that sort of an issue, everything else after that is going to be fine. And finally you're going to want to actually fill up that event store with something. And for that purpose you're going to need to building an event archiver of some sort. Now building it completely from scratch, you would do it probably in go or you could probably use some sort of spark jobs or you could probably glue some things together to just move things off of Rabbit MQ into s three for all of the events. That is essentially what comprises an event driven system. Those are all the big components that are there. So you can kind of already tell that the big pieces of it really are. It's not infrastructure. The infrastructure is not actually that terribly complex. It's really the organizational aspect of it. Right. And it's kind of a paradigm shift really of how you think about things. How does this exactly translate to improve reliability? Well, number one is you do not have to think about service outages anymore. And by that I mean, well, if a service goes down in the midst of dealing with a request, not obviously, but it has not acknowledged the message that it has been completed. All right, so even if it dropped in whatever state, wherever state it was before it was actually completed doing the work, it's going to pick it up when it comes back up. And you do not need to write any extra code for that. That is just basically part of how all of the services in your stack should be operating. They pick up messages, they react to them and upon reaction, upon finishing working on the message, they acknowledge them and they move on. Right? So basically, yeah, service outages are a significantly smaller thing. The fact that you are no longer relying on any other services around you. You are the master of the domain. There is a single service that only cares about itself. That means that your failure domain is really small and you can put all your efforts into making sure that that failure domain is actually solid. Realistically, really what we're talking about is it's really one service and the two or three dependencies that it has, which is your event bus and ETCD and whatever 3rd, 4th dependency that you have. But that is definitely no longer just a service by itself. So when you're talking about thinking through your failure domain, you no longer have to think about, well, what happens when service a or b or c goes down? And what if d becomes slow and so on, because they're not really part of your failure domain anymore. True service autonomy, that is one thing that was promised to all of us when the microservice pattern emerged as like a clear winner that, oh, everybody gets to work on their own stuff now. Well, the fact of the matter is that that's not entirely true because, yes, even though you own the code for your service, you are still highly dependent on this other team that owns service b that you have a dependency for. If they change their API, well, now you have to update all your code as well and you're going to be delayed. So what we're talking about here is that, again, because we do not care about anybody else but ourselves, we are truly becoming autonomous now. Well defined development workflow. What this is referring to is the fact that you are now going to have some sort of centralized schema which is going to represent everything that can happen on your system. Basically what we're talking about is a single message envelope that is going to contain actions, parameters, all kinds of stuff inside of it. And as a result of doing that, you're centralizing one way to communicate your interfaces for services, being able to do something that means that you no longer have to have situations where one service is using swagger, another service is using, I don't know, insomnia or postman or something like that. And there is now one repo which has all your schemas that say like, this is the sort of stuff that I expect to be in this message and that's the end. And that is super, super wonderful. There is really just an entirely another talk that we could do just on Protobuff schemas themselves alone. So we're just going to leave it at that. But point is, Protobuff is not even really that hard to begin with, or Avro. And what am I looking for? None of those message encoding paradigms really are that complex to begin with. And then finally we have dramatically lowered our attack surface. The fact that we no longer need to talk between services means that we are able to implement some sort of rules, like a blanket rule set on our firewall, to simply say like, no, we no longer need to accept any inbound connections and the only outbounds that we allow are talking to the event bus and talking to eTCD or whatever else. So you no longer have to basically punch firewall holes all over the place just to be able to allow one service to talk to another one. And that is amazing, absolutely amazing. And security is going to be super happy about it. So one thing is that's really important here is that folks probably want to see what it actually looks like to do something completely event driven. Batch uses is 100% event driven and we have about 1920 services or so. And every single one of them is based off of the singular go template that we constantly keep up to date. You can go and check it, but right there it is public. Just go at it. We use Kafka and Rabbitmq. RabbitmQ is our system bus for basically the bus where we communicate, state and so on. And Kafka is used basically for high throughput stuff. Yeah, feel free to take a look at that. Now, youve probably thinking that this all sounds terribly complicated and. Yeah, it's pretty complicated, it turns out. Yes, from a technical perspective, it's actually not entirely that complicated, depending on the expertise of your engineering team. However, the complicated part is really, I guess, the political aspect of it and trying to communicate all these changes and so on across the engineering. That part is really complicated. I also think it's really important for you to understand your message bus inside and out. There is nothing worse than actually designing foundation that you think is really beautiful and realizing some months later that this feature that you built into your foundation that you handcrafted and so on was actually something that was completely supported in the message bus itself. And I say that only because I have totally done that myself several times, only to realize that I really should have just sat down and just gone through the docs to know that like, oh, wait a second, I can just do this automatically. The bus can take care of this for me and I don't need to come up with some sort of a solution. Another big part is you absolutely need to accept that the event bus or the events really are the source of truth. That is super important. It is really one of the main, main points of event driven and the event sourcing architectures. You should also embrace eventual consistency and same with item potency. And you should just anticipate complex debug where debug was actually fairly straightforward ish, if you had request tracing with HTTP. Now debug has gotten quite a bit, quite a bit more difficult. This space is still kind of greenfield. There aren't a whole lot of tools, so youve probably going to expect to have to build some of this stuff yourself. I figured it would be probably helpful to maybe put down how much time certain parts of this are going to take, at least from the technical perspective. I have a couple of more slides at the end. At the very end of the presentation, which goes into how much time it takes for the organizational aspect of it as well. But I'll leave that off for later. So first things first, to set up the actual foundational infrastructure. I think it's the easiest part by far. It really shouldn't even, probably take even one week, maybe max two weeks or something like that, especially if youve using some third party. Things like that are from a vendor such as s three, right? Or maybe for the event. Plus youve using some sort of a platform as a service such as compose IO or something like that. Defining the schemas might take a little bit longer. It also varies highly based on how much expertise do you have in regards to how your product works. Do you understand every single part of it or do you understand only a portion of it? That means that you're going to have to talk to more people and so on to make sure that it fits correctly. And then when I'm talking about schema publishing and consumption, all that I'm talking about really is CI for creating releases for your schemas. It can be anywhere from medium to hard. Really, it entirely depends on how complicated your schemas are. Are you using protobuff? What are you using actually for the messaging, coding, all that sort of stuff? A really important thing is to provide an example service that uses event driven. You do not want to give your developers basically just a mandate that you should be doing a vendor event. You want to provide something like libraries, probably wrappers, that sort of a thing to say, like this is how you do it, and you just plug and play essentially, and you get everything. And then the last parts that have an asterisk next to them, they are by far the hardest parts of all of this. With that said, they are also potentially not needed right away. You'll probably want them eventually, but they're probably not needed right now. The most important one by far is having some sort of an event archiving solution. And that is basically something that is going to consume all the events from rabbit or Kafka and stick them into something like the long term storage, like s three. There is a little bit of complexity in that, and the fact that you actually need to probably group the events. You need to put them together because you don't want to have 1 million files in one directory or whatever, one object space in s three. So youve going to want to group it, munge it and compress it, that sort of a thing. But the rest of them, such as a replay mechanism or an event viewer. Maybe you don't need it right away, but you will probably need it eventually. Some things, some quick tips in regards to when to stamp this down and what is the best approach to all this. So number one is that if this is a brand new, you have a brand new platform, everything is new, and you know what you're doing. This is awesome. That is the prime time. It is absolutely amazing to implement an event driven platform. It is fantastic. But you should really still only do it if you have a complete understanding of everything, how everything works, how youve can imagine how all the services are going to interact. You probably need to have a very good idea of all the flows that are going to be happening in your system. Really? Really. This kind of goes without saying, but I'm just going to point it out again, or I'm going to point it out is that this is largely based on youve engineering capability. If you have experience in this and you have experience dealing with architecture and with design, you could probably pull it off. If you do not, then you might want to have some friends around who are going to be able to put on their architecture hats together with you and think through all of this. With that said, you almost certainly want to do this with somebody else, even if they're less experienced. Just to make sure that you're not doing something totally silly and egregious. One final thing is that do not use CDC as your primary source of truth or as your only source of truth. You can totally use CDC or change data capture, but use it as a helper, not as the primary way to create events. Actually let your services create the events, not your database. So in most cases, though, you're probably going to have an existing.org and in that case, definitely move to event driven gradually. Do not try to do it in one big fell swoop. It never works. I have never seen it work. It is going to be a massive waste of time and there's going to be problems. There's functionality will be missed. The timetables that you assign to it are going to be exceeded dramatically. Just do it gradually, little by little. Basically you can utilize CDC or change data capture for this, where you are just going to expose some database little by little. Like every single update that is happening in the DB, you're going to push it out as an event and have some services only rely on that. The only caveat there is that you do not want to have a service. Half rely on CDC or directly on the database, and then half of the same service rely also on events as well. It's either it relies on one or it relies on the other. Now, in regards to some more reality really, of where does Sre fit into all of this? I think that Sre in this particular case is by far the most important part of the conversation. They need to be involved of anything that deals with distributed system design, really, to a certain degree. The only folks that truly understand things, how they work at a platform level are sres. So if you are not involved in the conversation, you should be involved. And if you are involved but you're not a lead, you should be a lead. You should be leading the charge on all this sort of stuff, right? Another thing is, you should know that this is a totally greenfield area, this event driven in general. Like granted, you might see some stuff about react or something like that. That is event driven as well. In essence, when it comes to event driven for systems, there's not a whole lot of stuff out there. It's part of the reason why I wanted to start a company in this space, because there's not a lot of stuff out there, and I really wanted to build something that addresses some of these issues. Another thing is, you need to get comfortable wearing an architecture hat. You will wear it no matter what. It's going to happen. And if youve are being a thought leader and you are providing documentation and talks and so on and so on, like it or not, you will have an architecture hat. It's just simply going to happen. Another thing then is, I guess the final bit to all these tips is that really, I would focus heavily on documentation and really on written culture in general, that everything should be written down. Even though a lot of us now are completely remote anyways, now it makes even more sense to do it than before, but you absolutely must have a written culture in place with an event driven system. A lot of stuff feels a lot like magic when it just works properly. But the thing is that when it doesn't work, you will absolutely wish that you had some sort of flows, some sort of flow diagrams, runbooks of how things are supposed to work, maybe proposals for how a certain part of the architecture is going to work, and so on. So written culture is super important. So in essence, in exchange for a pretty decent amount of complexity, you are going to gain a lot, right. You are going to gain real autonomy for both services and teams. You will be able to rebuild state, you will have predictable failure scenarios, your recovery time should increase massively because you no longer will have to hunt down various teams and so on. You will be able to sustain really long lasting outages. Not that that's a terribly great thing to have long lasting outages, but you will be able to sustain them. Security will totally thank you because you have just added this massive improvement on the entire platform and you have a really solid and super well defined and very robust foundation. It's not going to take very much explaining to say this is how it works, because there's basically only one thing heard, which is you emit an event and an x amount of services are going to do something about that event and are going to update certain parts of it. That's it. And then really the final part is that you will have a lifetime of historical records. And that is super amazing because you now have all the things in relation to any sort of certifications that you need and so on. You essentially have an audit trail and on top of you have analytics for your future data, science teams and so on. Just as a quick side note is that batch is using. I think I already alluded to this, that batch is 100% event driven. So I'll just give you some quick stats as to what it means for us is that, number one, we're a complete AWS shop. We are using eks, we're using MSK, which is the managed Kafka, and we're using lots of EC two, plus whatever sort of random assortment of their other services, the basics of route 53 for DNS and so on, and rds in some cases. We have a total of 19, I think somewhere on there in like 1920, maybe a little bit more Golink services. All of them are based off of that go template that I mentioned. We are 100% event driven, except of course for the front end part, which the front end has to talk to a public API. Those are synchronous requests. But we have zero inner service dependencies. Most services have only three dependencies, and that is rabbit, etcd and Kafka. And that is it. That means that we have an extremely locked down network. I'm skipping a few points there, but that is the result of that. We do not have a service mesh, we have no service discovery. We do not need any of it because we don't really care about accessing those services, and those services don't need to talk to each other. We don't care about having to have mtls between every single service and so on. The only things that we have is just mtls between the service itself and Kafka, and that's about it. And of know rabbit and eTCD and what this means though of course, is that instead of triggering behavior via curl or postman, we are actually triggering behavior by emitting an event onto the event. But, and we use a tool called plumber that we developed. It's also open source, you can see it under the batch corp GitHub.org. But it's a slightly different paradigm, right? Instead of just doing curl for an HTTP request. Well now we do plumber, but it's essentially the same concept as well. All we're emitting is essentially a message that is a protobuff message, right? I already mentioned, yeah, that our network is massively locked down. We're barely able to talk to anything where the services are able to only talk to very specific other ips on the network. And then the stats are really that our average event size is at 4k. We have about 15 million system events and that takes up about 100 gigs of storage in s three, which really translates to a couple of bucks or so a month that you need to pay for s three. It's really, in the grand scheme of things, it's absolutely nothing. So while it may seem that you are super amazing now at event driven, you're probably not quite yet. Not quite there yet. I'm not there yet either. But I have a decent ish idea. But to get there you should probably spend a little bit of time doing a little bit more research and reading on this topic. And you should totally start off with Martin Fowler's event driven. This article basically just talks about the idea that the fact that, well, event driven actually contains multiple different architectures. So you should be aware of what you're talking about when you say event driven in the first place. So it's just a good overview really of event driven in general. Then the event sourcing architecture pattern, which is basically what we discussed here, when you're using replays, that's essentially what youve implementing. It's all under the same event driven umbrella really. And cqRs, same thing. It's another architectures pattern that fits really well within the event driven umbrella. Now, in regards to item potency, everyone should probably read about what is an item potent consumer. The microservices IO site does it in very few sentences, much better than I could to explain what is actual item potency. Now there is one article that I found, which is written by Ben Morris and it is super awesome. There is one particular quote that I really liked, which was that bear in mind that if you rely on message ordering, then you are effectively coupling your applications together in a temporal or time based sense. And if you read between the lines, really what it's saying in that case is that, well, that youve probably don't want to do it. Youve creating potential problems for yourself. So in other words, do not rely on ordering. Create your system so that order doesn't really matter. And that's like where item potency comes into play. And at the same token, exactly one's delivery, that's another topic that can be discussed in detail for a very long time. But exactly one's delivery is very difficult and potentially snake oil. There, I just said it. Yes, Kafka does talk about it and yes, technically it is doable, but there are many, many caveats to it. So it's simply much better to design your system so that your services are able to deal with duplicate events in the first place. Okay. And that is basically it. I hope I was able to show you something, show you something new. And if this is the sort of stuff that is interesting to you, you should totally come and talk to me. If youve want to work with stuff like this, you should also come and talk to me. This whole space is super interesting and it is still fairly fresh. There aren't a lot of people doing it. And in general, high throughput observability is not really a thing yet. And we are trying to make it happen. So if you are thinking about going to event driven, if you have evaluated event driven and youve are not interested in it or youve afraid of it, come talk to me. I would love to nerd out about this sort of stuff in any case. All right, that is all I have. Thank you very much and goodbye.

Slides

Download slides (PDF)

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Reliability Nirvana

Video size:

Abstract

Summary

Transcript

Slides

Daniel Selans

Co-Founder @ Batch

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Reliability Nirvana

Video size:

Abstract

Summary

Transcript

Slides

Daniel Selans

Co-Founder @ Batch

Join the community!