Event Streaming and Processing with Apache Pulsar

Video size:

Abstract

When it comes to distributed, event-driven messaging systems, we usually see them supporting either one of two types of semantics: streaming, or queueing, and rarely do we find a platform that supports both.

In this presentation, we’ll first get an introduction and some clarifications of event-driven versus message-driven systems, event streams, and stream processing.

We’ll then take a look at Apache Pulsar which offers a very unique capability in modern, cloud-native applications and architecture, in which its platform supports both Pub-Sub and Message Queues, and extends into streams processing as well as performs message mediation & transformation. We will look at how it relies on Apache Bookkeeper for its durable, scalable, and performant storage of log streams, and leverages on Apache Zookeeper. We will also see how Pulsar is meant to bring the best of other systems, such as how it fills the gaps that Kafka has and extends its streaming capability in the complex cloud world.

Summary

Mary Grygleski is a senior developer advocate at Datastax with the streaming team. Her specialty is in Java cloud native data DevOps and now moving into more streaming. Here are several ways of how you can stay in touch with her.
Apache Pulsar is meant to extend what we have today. It's all about asynchronous processing, concurrency, scalability. It makes the problem even more challenging than before. This is where pulsar can come in and augment the situation.
Events is essentially an occurrence that happens in time. Events are more attuned to what human beings are in computing. Real time communication actually isn't all new. But think of the fact that if we do it synchronously.
Event streaming processing involves taking action on a series of data points. CEP is a set of techniques for capturing and analyzing the streams of data as they arrive to identify opportunities or threats in real time. We want events to essentially help with human conditions and make things happen a lot faster.
And now event driven versus message driven messaging. Think of it more like what we call publish, subscribe or producer consumer type of systems. As systems becoming more sophisticated, then we're seeing more and more event driven systems that comes into play.
Streaming is really referring to the pub sub, the publish subscribe type of a paradigm. Topics essentially just label so that you group all of these messaging together. Those who are interested will have to subscribe to that topic to receive it. Message queueing is a form of asynchronous service to service communication.
Event streaming allows for scalability elasticity to happen a lot more in a more flexible way. It helps to meet the demands of large volumes of data generated by application operating at the edge Iot systems. We use real time data to enhance customer experience and create a competitive advantage for your business.
The real time aspect is what is kind of the core of why we want to do event streaming. We wanted to be able to ingest high frequency of messages, but with very low latency. There's not much of a gap in between when you receive your messages and then process and output it.
Apache Pulsar is a unified messaging and streaming platform. It's cluster based, multi tenancy, and the client API surprisingly isn't like way too complicated. It supports Java, C sharp, Python, Go, JavaScript among others. It became a top level project in 2018.
Apache has a project called bookkeeper. Essentially it does electronic ledger and journals. pulsar leverages on bookkeeper for logging, storing all of these things. It has adopted a tiered architecture design kind of approach. This distributed architecture supports like horizontal scaling really well.
Bookkeeper is a protocol that helps with the writing of all of these log records too. It enables fast write is guaranteed through the journals and then ledgers. It is a library, a project of its own under Apache. I won't go into all the deep details, not just yet.
Apache pulsar represents the next generation of enterprise messaging. It is a unified solution for pubsub up for streaming, for messaging. Also for queueing and also for the messaging, mediation and enrichment. Can run on prem or hybrid georeplication. There's still a lot of things being planned for it.
Pulsar provides lower latency and higher throughput. Flexible message processing model. Multi tenancy too. And basically it's good fit for bring use cases as well. Why is it better than kafka?
At Datastax we offer Astra streaming over here or lunar streams, right. Lunar streaming essentially is all free unless you want to purchase the enterprise support too. Please follow my Twitch stream every Wednesday at 02:00 p. m. Us central time. Follow me on Twitter and also my LinkedIn handle.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, welcome to my talk. My topic today is on event streaming and processing with Apache Pulsar. And again, I'm very happy to be back at Conf 42, cloud native 2022. This is my second year joining this conference, so thank you to all of the organizers, especially to mark for your help. My name is Mary Grygleski. I'm a streaming developer advocate at data stacks, and here's my Twitter handle if you'd like to get a hold of me. So who is Mary? I'm a senior developer advocate at Datastax with the streaming team. As such, Datastax is a company that specializes in data management and also in NoSQL database as well as cloud native database. As a service managed platform, we specialize in big data. We can handle data in really huge volumes of so that's about me. And before this I was at IBM for about three and a half years with the Java and then websphere team. So my specialty is in Java cloud native data DevOps and now moving into more streaming. And I also work on some reactive stuff in the past. I'm also a very active community builder. I'm the president of the Chicago Java users group as well as the co organizer for several IBM sponsored meetups in the Chicago area. And before this too, I was a developer for over 25 years in various capacity. But primarily I was really heads down into doing engineering work development, and then I moved more into technical architecture. But I've involved, so to speak, in it department, working in the trenches, doing design integration, testing, development, deployment and all kinds of stuff. And when it kind of get into DevOps, I also got my hands dirty into that. And here are several ways of how you can stay in touch with me. My Twitter handle, my LinkedIn profile. I also have a Twitch streams as well as a discord channel. But I will share this with you towards the end of this presentation. So here's today's agenda about event streaming and event processing using Apache Pulsar. But first, let's start out with some fundamentals, talking about what is event streaming, what is event processing, what is complex event processing, and then event driven versus message driven type of messaging. There are some subtle differences in there. Two, very often two we hear today about events streaming. All of these things all kind of bundle up altogether. It's all about asynchronous processing, concurrency, scalability, especially in the cloud native world too. It makes the problem even more challenging than before. Okay, so after this introduction, I'll get into a bit talking about event processing, the semantics pub sub queuing these are like typical semantics. And then why event streaming, right? Why do we want to do this? And then I'll give you a bit of an intro to Apache Pulsar, and also Apache pulsar depends very heavily on Apache bookkeeper as well as Apache zookeeper. So I'll give you some kind of background information to that. And then essentially two, Apache Pulsar is meant, in my opinion, it's really meant to extend what we have today. For example, like Kafka. Kafka does its job perfectly well, but there are certain limitations. So this is where pulsar can come in and augment the situation, make it even better. So with this, let me start this talk, okay, the many facets of computing events. So first of all, right, what is an event? And also bear with me too, if you are coming in here already knowing some of these, but I have to assume that not everybody has this background. So I want to go over some basics. So what is an event like, generically speaking? So for me, I looked up the dictionary miriamwebster.com right away. First of all, something like really common to any kind of scenarios for events, they are really about something just happens, an occurrence. Now, as I kind of paged through that list of interpretation, number four really caught my attention. It's essentially talking about events being the fundamental entity of observed physical reality represented by a point designated. So what is this point? This point is designated by three coordinates of place. So the XYZ coordinate coordinates, and then also one of time. So in the spacetime continuum and postulated by the theory of relativity, so it essentially do in computing. And that's how we look at events. Events is essentially an occurrence that happens in time and is represented by the point that's actually then by the three places, right? The three coordinates of place, so XYZ, and then in time, too. So I feel that it's really an exciting topic. It gets more into the abstract, but also, too, if you think about it, events actually are more attuned to what human beings are in computing. I think when we all first learning about computers, we're kind of looking at things very mechanical, so to speak, right? We define data, we flatten out whatever it is, or maybe like flatten out comes structure, right? We kind of assign that kind of properties to your objects and then trying to make sense out of it and then do the computing. It's actually a bit kind of awkward if you think about it, right? All of these processing we used to do would be synchronous, right? It's blocking. If you issue a request, for example, like in HTTP, even in web interaction, HTTP protocol is a stateless protocol. So you issue a request and the request essentially has to wait. You send a request over, you have to wait for the response to come back before you can proceed. So it's a very blocking, very time consuming type of protocol, and it doesn't really mimic what human beings do. We do well, do we actually wait even like talking on the phone? And as you can see, talking on the phone. Real time communication actually isn't all new. If you think about it, our telephone systems is essentially real time communication. But think of the fact that if we do it synchronously, wouldn't it be seem kind of awkward, like if we call someone, I guess it's probably like the old walkie talkie, right? When you talk about something, only one person can talk, and when you're done, you say over. And then the other, your sender, receiver, or both persons in the conversation will have to kind of take turns to speak. So it's a bit kind of like less kind of natural. But again, this is events. We want events to essentially help with human conditions and make things happen a lot faster. So, first of all, what is event streaming processing? It is really the practice, right, of taking action on a series of data points that originate from a system that continuously creates data. So I know it's like a mouthful, but events essentially refers to each data point in the system. So if we represent the stream, right. The stream then basically offers to an ongoing delivery of those events. So stream can be, if you represent in a straight line, it can be many events that are represented by dots that happen in time. So this is essentially how you can represent events. And a series of events can also be referred to as a streaming data or data streams. How about complex event processing? So, complex event processing, CEP is a set of techniques for capturing and analyzing the streams of data as they arrive to identify opportunities or threats in real time. So we find that, for example, some kind of security scanning, monitoring, they are really kind of involves complex event processing. If something happens, it triggers comes alarm, that type of scenario. So CEP enables systems and applications to respond two events to trends and then patterns in the data as they happen. So that's complex event processing, essentially. I think we deal with this every single day in our lives. So we want two do that in computer. Okay. And now event driven versus message driven messaging. Event driven, right. This interpretation I take is actually from Lightbend, the company that makes ACA, the reactive framework, ACA, reactive systems that abide by the reactive manifesto. So that's event driven. What it says is sender emits messages and interested subscribers can subscribe to the messages. So that's event driven. Now there's also message driven, but as such too. Let's go back. Event driven. Think of it more like what we call publish, subscribe or producer consumer type of systems. So the sender basically sends the messages without having to worry about who's going to receive it. It's up to those. The subscribers who are interested in it will need to subscribe two the messages. And that's how it works. Now for message driven, it's not the same. It's basically sender and receiver. They already have to know they are known to each other beforehand. So the address needs to be known. So that's message driven messaging. And usually message driven is really about like two parties that are communicating with each other. So they know each other, who they are. So as you can see, these are differences between the two event driven and message driven messaging. And now if we look into it, right, what happens then before event taking an event approach? I mean, think of it in today's IT system, in IT department, right? There's still a lot of batch processing happening. And essentially this is just what happening, right? We collect all of the data and then kind of group them all together and then batch them up essentially, and then have the processing going on perhaps in the middle of the night when nothing else is happening. So then you don't take away processing capability in your system to kind of crunch through all of these heavy data set. So that's what happened, right? Especially in the 90s, in the early two thousand s, a lot more batch processing. But as systems becoming more sophisticated, then we're seeing more and more event driven type of systems that comes into play. Now don't get me wrong, batch processing still has its place and it's still very relevant, especially if you are in IT department that are dealing with kind of like system of records, like you're processing something that can be done in batch, I believe too, even a lot of machine learning processing also being done in batch. The reason is that the data set is just too big. It's hard to all processing them all in real time. But then this is the reason why we want to develop systems and capability such as Apache Pulsar is targeting for really the big data set. And that's what it is. Now let's take a look then into event messaging, some semantics and patterns. So streaming, right? So streaming. This is what we are kind of all about. And my job is essentially about advocating for streaming. So streaming, when we talk about streaming, is really referring to the pub sub, the publish subscribe type of a paradigm, right? So publishing client will sends the data, and then in a pub sub system there's also always like a middle person, the agent in the middle called a broker. The broker is the one that's kind of like mediate and everything. So it receives the messages from the publisher, publisher sends the data and broker gets it. And then based on a lot of different types of configuration parameters, the broker will deliver the messages to those who are interested in the messages. So it requires subscriber to subscribe to it. So the subscribing client will receive the data, but the subscribing client will have to initiate and subscribe to the topics, right? So very often in a pubsub system, we're talking about using topics like Kafka is using topics, Pulsar has it. And even IoT messaging such as MQTT, also is a pub sub system that has topics. Topics essentially just label so that you group all of these messaging together. And then essentially those who are interested will have to subscribe to that topic to receive it. Message queueing so what is message queuing? So, message queues is a form of asynchronous service to service communication used in serverless and microservices architectures. Messages are stored in the queues until they are processed and deleted. So essentially, first in, first out type of principle applies. And the messages are only intended for one single customer or consumer. So only process only once. And message queues can be used to decouple the heavyweight processing, to buffer or both work, and to smooth out any kind of spiky workload. So as you can see, the queuing helps too, in kind of offloading some of these heavier processing. So that's what message queueing is. So about event streaming. So why event streaming, right at this point in time? Right? So earlier I talked about event streaming is because it really mimics more of the human behaviors, a lot more natural to us, but it's also at the same time much harder to do because now we are talking, bringing into concurrency, bringing into events that are happening at different times. It doesn't happen yet, and you have subscriber waiting for it. So it kind of involves a lot more computing. Well, kind of managing the computing at the back end too. So it's more complicated, but you reap the rewards, the benefits of it being real time. So that's what it is. So what's driving this change? Right. So if you look into it, we use real time data to enhance customer experience and create a competitive advantage for your business. So on a high level, that's what it is. Marketing people, we like to talk about advocating for real time. Essentially two things happen right away. If you are at a football game, you want to know the statistics of a player right away. It can be streams, that data can be streamed immediately, analyzed and output for you. Within seconds you can get that kind of a data. So that's kind of like some advantage of that. And then there's also data pipelines. Two, right. That's actually very useful to build AI and ML machine learning, smart models from time series event streams, for example. So that's kind of data pipeline is another good usages of event streaming or kind of like how event streaming can help with building data pipelines and then also kind of one important aspect of event streaming is it allows for scalability elasticity to happen a lot more in a more flexible way. It helps to meet the demands of large volumes of data generated by application operating at the edge Iot systems. For example, if you use like a pub sub kind of broker, you can actually have like a billion, or not even billion, maybe like petabytes of data within. Well, I guess maybe I'm exaggerating, but there's been actually benchmark being done to like 100 billion messages being published with the topics too, like that. So things like that, right. It's really the scalability aspect that it helps, especially in today's cloud native world. Now then, why event streaming? So let's talk a bit about it. We want to watch for events with the system or application, and you want to also subscribe to topics to see certain event types, right? And then you want to make decisions on data in real time, not after the event. You want things to happen, you want to see the data, you want to see it being transformed and churn out some useful data for you. The real time aspect is what is kind of the core of why we want to do event streaming. We wanted to be able to ingest high frequency of messages, but with very low latency. Right. There's not much of a gap in between when you receive your messages and then being able to process and output it to some format, for example. So essentially it's the real time aspect. And now comes Apache Pulsar. So what is Apache Pulsar? Right. So as such, Apache is Apache Software foundation. It is open source, governed by the Apache Software foundation. There's like proper top level project. It was actually open source by Yahoo. It was first developed by Yahoo back in 2016, 2000, and actually 15 or 14 that time frame. Then Yahoo contributed two Apache Software foundation in 2016, and then in 2018, within less than two years, in fact, it already became a top level project. The reason is that it's really built in a lot of the capability of doing cloud native that is really more lacking in most of the other systems, in fact all of the other systems. So that's where it kind of comes out kind of outstanding in that regard. It is again cloud native design. It's cluster based, multi tenancy, and the client API surprisingly isn't like way too complicated or anything. It supports Java, C sharp, Python, Go, JavaScript among others. Scala too, for example, if you want JVM languages, it's also supported. There are others ones too, so you can check the website. And basically the key of it is that pulsar separates out the compute and the storage. And that's what makes it a lot more easier from the scalability kind of perspective. And another thing to pulsar being a pub sub broker, it has built in already, like guaranteed message delivery mechanism in there too. So if a message successfully reaches a pulsar broker, it will be delivered to its intended target, right? So depending on the message level, for example, quality of service, it can be quality of service level zero, it's fire and forget then. In that case, the broker isn't held accountable for kind of guaranteeing messages delivery. And then you can have QoS one, which is like one or more delivery, or QoS two, which is exactly one. So that's a lot more precise with QoS two. So all of these are being supported by pulsar too. And then as such, two, pulsar is very lightweight, serverless, has a very lightweight serverless functions framework too. It's called pulsar functions, and it creates complex processing logic within a pulsar cluster. For example, if you're building data pipeline, you want to fetch the data and you want to do some work of transforming your data or mediating it or whatever you need to be doing. So you can leverages on pulsar functions to help you to transform and mediate all of these messages too. So it kind of comes in very handy. It has already command line that helps you to deploy, build your function, and then deploy eventually kind of get it running right, or you can kind of do it using the console. For example, I work for data sachs now, so we also have Astra streaming, which is the commercial version, Astradb is the commercial managed database, Cassandra database, and Astra streaming is attributed two the streaming that Apache pulsar enables Cassandra to have. So then you really enable more of the real time processing aspect, right? So you can actually use pulsar functions to do data transformation as well. Okay, that's just an example. And then they're also like the storage offloads too, are also like tiered storage offloads. What it means is that it offloads data from hot and warm storage to cold and long term storage when the data is bring out. So when it's kind of fresh, maybe you still keep it in hot or warm storage or in memory as well. Or essentially two I wanted to bring out is the bookkeeper. And so basically it knows, it has built in capability to recognize that if some data is kind of aging out, not being used as much, then we may as well write it, store it into persistent long term storage and cold storage, kind of like that. So essentially Pulsar has this built in to handle the tier storage offloads. So again, what is pulsar? Right, let's give a bit of some detail about it. So it's a unified distributed messaging and streaming platform. So what does it mean? Right, so basically too, it supports messaging. It supports messaging not only pops up, but also the queuing too. Not only the messaging itself, but it does transformation and mediation through pulsar function and things of that nature. Right? So it's really a truly distributed messaging platform that's suited for today's cloud native world. It's open source, right? Again, it kind of came out from Yahoo, now part of Apache Software foundation, and in fact one of the fastest growing projects with a lot more committers has, you can see over on these graphs, right? So GitHub stars has increased since is becoming like 2018, when it kind of became top level. You can see too, it's like kind of a bring kind of jump, 2 June 2021 and so are the number of contributors and monthly active contributors. Actually, we kind of overcome that in June of 2021, overcome that of Kafka too. So as you can see, it's gaining popularity. And again, cloud native ready. Kubernetes ready. There's also a Cassandra, or actually I should say Cassandra is more like the, sorry, that's kubernetes for Cassandra. But the thing is too, is that Pulsar is almost like it's basically already to work with kubernetes and with Cassandra too. And it supports multi cloud and hybrid cloud. And in fact, I will share with you a link to how you can actually quickly test out pulsar through the Astra streaming too shortly. So four reasons why Pulsar is essential to the modern data stacks. Right? Okay, so we'll look at that after we kind of show you a bit of information too. Who else is using pulsar? Look at these companies, right? Comcast, Yahoo. Overstock, Splunk, General Motors, iterable, Cargill, Verizon, Tencent, Shopin, Nutanix, all of these. So these are like major players too. Pretty major players. Right? And then also sorry for this kind of more marketing slides. Two, I know I'm talking to developers, but I hope you appreciate that too. The fact is that there's been some comparison. Two, it really lowest or three year cost compared to Kafka too because of its capability. So it's higher performing savings for high complexity scenarios and savings for higher data volume scenarios. Two, and again, some brief history of Apache Pulsar, but essentially I already talk about it so I won't kind of go through all of it. Again, it's cloud native, distributed unified messaging and streaming platform has been open source as Apache top level project since October of 2018. And we just seeing the trends, it keeps growing too. And as you can see, it also has the lock four j two fixed back in December too. As soon as that came up, the community already delivered a patch for that quickly. So now let's kind of go a little into more of the pulsar. It is different, right? So why is it different? Right. So here it is. I just want to point out to you, there's a producer, so client application sending messages to topic managed by the broker. It's up to the consumer. The ones that are interested in consuming the messages, I'm interested in that I need to subscribe to the topic of where I know the producer is going to essentially send the messages. Two. So consumers subscribe to the topics and basically then there's the broker. The broker essentially is a stateless process that handles incoming messages and message dispatching, communicates with the pulsar configuration store and store messages in the bookkeeper instances. So the broker itself actually interacts with the bookkeeper as well. So what is bookkeeper? Right. So again, pulsar is unique in the sense that it doesn't want its compute side to kind of be worried about doing bookkeeping, as we all know, right. As human beings we're doing accounting. Do we like accounting? Most of us don't because it's crunching through a lot of numbers and all of these things. So it's kind of tedious, right? A lot of these tedious work, but yet they are utter importance too. So Apache has a project called bookkeeper. Essentially it does electronic ledger and journals and all of these things. If you are accounting kind of enthusiasts or you're familiar with it, then you're familiar with those terms. Essentially it's just bookkeeping being electronic, going digital. And that's what pulsar leverages on for is logging, storing all of these things, right? And so as you can see, bookie itself has different segments in the storage kind of for better organization and managing and faster retrieval and all of these things too. So broker is the one that's serving, but it also interacts with the bookkeeper. Now there's also zookeeper I bet comes of you or most of you probably already know of zookeeper. That's actually, it's more like it manages the cluster, right? The metadata, the coordinating tasks between different pulsar clusters. So as such, right? The name zookeeper is basically to keep order in the zoo. So as you can imagine, right, this whole how cloud native thing can become very confusing. So zookeeper comes in, manages all of these metadata, handle all the coordination, making sure nobody steps on each other's toes essentially. So that's essentially pulsar component. Just wanted to highlight too the design principle of pulsar. It has adopted a tiered architecture design kind of approach, right? And it's a traditional multi node architecture. So as you can see, we already looked at producer consumer. They interact with the broker, producer sends message to the broker, consumer is interested in it, will subscribe to the topic, right. It's actually, I should say producer sends messages to the topic and basically it's the broker that kind of manages, right. But essentially too, broker also will have to acknowledge back on the network kind of layer too. Okay. So as you can see too, topics too can be partitioned off too, as you can see. So broker has different topics bring partitioned off. And essentially too, this distributed architecture supports like horizontal scaling really well. And partition topics, two abstraction, mass complexity for consumers, right? So consumer don't have to worry about it. It's the broker that manages all of these topics are partitioned too. So common challenges, right? What are some of those? So essentially if you look into it, why using Apache pulsar? If you look at it, traditional scaling will require partitioning, rebalancing all of these things and having pulsar actually separate up the compute and the storage, it helps the scalability and rebalancing much more like cleaner than it would be if you have a component that kind of combines both compute and logging, right? So if you need to rebalance, how do you do it? It's quite a messy situation, but if you separate out the concern, they each takes care of things kind of independently, but yet they are all tied together too through the broker. So tightly complex persistence and message service capabilities impose high costs on historical data. The trade offs to support partition topics came at the expense of messaging semantics needed for use comes such has queueing. So as you can see too, basically if you don't kind of have separate out this concern, then basically two, the messaging semantics will have to kind of take care of use comes such as like queuing essentially that's what it is. Okay, let's take another look too into this tiered architecture design of pulsar, multi node architecture. And so what's the big deal, right? So essentially it's fast, it's low impact, it's horizontal, it supports horizontal scaling and reduce the capex and opex too, which is like capital expenditure and operation expenditure at your company. So broker is stateless and it has built in cloud balancing and the scaling is pretty instantaneous. And essentially disaster recovery is zero impact. So anytime you need to have recovery of these things, it pretty much can kind of scale up and down and takes care of thing itself. If it is disaster recovery, you need to duplicate certain setups and stuff. So that one too, because of the nature of the design of the multi tiered architecture and separation of compute and storage, it's actually make it a lot more flexible, more dynamic in that sense. Now let's get into bookkeeper a little bit. So bookkeeper again is the bookkeeper very scalable and is wall based. So this is the right archive. But this is a protocol that helps with the writing of all of these log records too. And it kind of helps with the ordering as well. So fault tolerant and low latency storage services, tunable consistency for message replication, the ensemble size, write quorum, act quorum, all of these things, right? So you can tune that too in bookkeeper. So has, you can see if you kind of lump all these into pulsar to do it, it will be overburdening pulsar. So that's why pulsar doesn't concern itself with things like that with bookkeeping, because pulsar is a lot larger. Well I shouldn't say larger, but has its own fish to fry. That's important to handle. All of the brokers, all of the messaging guarantee, all of them. So I don't want to overburn myself and having to worry about bookkeeper. I let bookkeeper do it, the bookkeeper. So as you can see, right, they're also journaling in bookkeeper. And it's essentially too, it enables like fast write is guaranteed through the journals and then ledgers, two electronic ledgers are basically segment centric data persistence via ledgers too. So as you can see, it goes very deep into bookkeeper. It is a library, a project of its own under Apache. So you can always look that up for more details as well, if you'd like. But for this particular presentation, I won't go into all the deep details, not just yet. Okay, so essentially then we look into the capability of Apache pulsar. So what problems is it kind of trying to solve? So essentially too, it solves also the problem of bolt on, because it represents the next generation of enterprise messaging. So think of it more like not coming in as a disruptor. Well, disruptor in a sense, but it doesn't really disrupt you. Meaning that if I'm on Prem, just keep going on Prem, I don't worry about it. Just essentially pop in pulsar, right. It's like a big giant bolt on. Just kind of like nail it in and it can work too. You can configure, obviously, and then you can get your whole thing running has is right. For example, this picture in here, you've got Kafka JMs Java messaging service, rabbit MQ. All of these are a bit of an older messaging now, but you have it in your company, you don't want to quickly change it again. It costs a lot of time and money too, so you want to keep using because it contains very important business knowledge. That will be hard if you try to migrate. So you can keep those. And in the meantime, kind of drop in pulsar. And pulsar is non disruptive like this. You just plug it in and then it fills. Get to work. Why? Because it is actually a unified solution. It comes in, has like an ambassador, like unifying everything. A unified solution for pubsub up for streaming, for messaging. Right. For queueing and also for the messaging, mediation and enrichment, the transformation, all of these. If you think of this kind of capability, I don't think there will be another project out there, a library out there that can support all of these. All capabilities of pub sub queueing, streaming, message mediation and enrichment. And then out of the box capability include. Right. So think of the fact that you can run on prem or hybrid georeplication. That's a big kind of thing of pulsar as well, messaging that you want to do replication across geographical area. It already has a lot of the very useful kind of even like user level capability built into. And we'll take a look at the next couple of slides. So georeplication, multiregion support, right? So you can have multiple regions too, maybe even within a geo area. It has kind of fine grained support too. For each area you can have different countries, let's say a huge enterprise corporation, you have different kind of rules and everything. So you go to different regions, they have different rules, and basically it's a matter of configuration within pulsar. And there's also like if you get into data lake, data mining, all of these two is also supported and much, much more. And it's only where the tip of the iceberg, right? The project is still quite not that old yet. There's still a lot of things being planned for it. Okay, so basically it unifies for all events, it's a unifying platform for all events in the enterprise. So essentially it's like the same kind of picture, but in this particular case, you can have this already on prem too, and Pulsar will help you do a job with very minimal disruption. Okay, here's another picture too. So unified infrastructure with built in geo replication. So as you can see, multi cloud, hybrid cloud, multi region in here. And if you look into right hand side too, if you have a company, a lot of large corporations, two can have systems that know very complicated. It's got really like a zoo, right? You look at this example, you get almost all software that you can think of, right? Oracle for database, traditional RDBMs, postgres, Mysql. Then you have cloud messaging, Amazon SMS and confluent. And then there's also programs, different types of programs, python, Golang, Java, Javascript, all of these things, no SQL database, right, Cassandra for example, MongoDB and Redis and whatever you want, right? So all of these can coexist. Don't kind of get rid of things, just keep running. Pulsar will come in and enable and augment everything. That's what the goal, the benefits of Pulsar is. So if you take a look into it too, it gives this universal upstream connectivity and also universal downstream connectivity as well. So if you look below in here too, you can also connect to streams, analytics and processing like flink, spark, data bricks, all of these other things too as such. And there's data lakes and warehouses like Amazon S three, Snowflake and Hadoop for example. And then if you're traditional messaging, Java, JMS, Kafka and MQ, it also has the compatibility layer that's kind of built to talk with them too. So it is really like positioned itself very well. So how is it? Pulsar is different, right? So it's a next generation architecture, provides a distributed tiered architecture. It separates compute from storage and zookeeper holds metadata for the cluster. Stateless broker basically handles producers and consumers. And storage is handled by Apache bookkeeper. So that's something we already talked about in a couple of slides back. Now this is like a rich ecosystem of connectors and clients. There are plenty of connectors too, and a really common one. For example, you can build data pipelines and send data into a sync. It can be elasticsearch, it can be MongoDB, Hadoop, Haskell, Cassandra, Clickhouse, all of these flume, all of these. So possibilities are limitless, I should say. No limit. Okay, so pulsar features rapid patching, zero downtime multi tenant, right? So essentially too you can have soft isolation via read write IO separation, independent storage quotas and message flow control and throttling mechanisms. Too hard isolation, two you can do via physically separate brokers and bookies for tenants too. So very flexible in terms of the multi tenancy type of capability. There's also IAM, the identity and access management that are pluggable authentication supporting TLS, SSL's next generation of version of SSH. That's TLS, Jot, JWT, authentic or Athens and then Kerp Rose and role based authorization provide control at the cluster tenant message broke producer and consumer level too. There's also provides you end to end encryption, intransit kind of TLS encryption and application managed content encryption too. So you can kind of be best assured your data will be safe. So key differentiator number one, separation between compute and storage. Like I kind of talked about it already. So that's why your scaling can be independent, you don't interfere with each other. Storage is handled by Apache bookkeeper, segment centric message storage management, fast and low impact horizontal scaling capability. The next thing is about native georeplication. You can have hands off real time message replication across data centers. So which is real nice, right? So meaning that you can know data center in Europe, in North America, all of these places, it handles all of the georeplication for you. And it's basically it helps you to meet your data compliance requirements across geo regions as well. So that's kind of a really big plus in there. Now there's also multi tenancy too. As I kind of talk about, it's like an apartment building, right? You have independent unit, they are isolated from one another, but they all kind of being handled by pulsar too. So likewise it's in a company, you can have a pulsar cluster then in it too, you can separate out according to function. You can have a tenant for finance and then another tenant for marketing, another one for product, for engineering. Within each they have different namespace that handles different functionality, as you can see in here. You can have a namespace that's kind of a microservice that actually handles message topics and everything. Then there's also marketing, right? You can have different kind of a namespace for handling like campaign manager, lead generation, all of these good stuff. And then there's also then finance, right? Fraud detection, for example. So all of these things too, as you can see, right? It's already kind of built in and it's not too complicated. Some things are just by configuration, then your system is up and then can handle all of these kind of capability that otherwise you have to write yourself, right? Number four, so flexible message processing model. So as you can see, right, pops up kind of model. Or you can do queuing too. But the thing is interesting too, as you can see in the topic pops up. You can have exclusive in terms of subscription, exclusive subscription in which a consumer can. And the topic are kind of tied very tightly together. It's exclusively for that consumer. And then another one will be a failover type of scenario. You can have more than one consumer, but only one is the primary. So if something fails, then the other consumer will kind of step in to take charge essentially to receive the messages. And then there's also like shared subscriptions. So shared is kind of a very important concept, especially in a cloud native environment. You can share the subscription and basically it's part of it. Also for cost computing, right? If you kind of use any resources, you'll be count by how much cycles cpu cycle you're using. But if you use a shared subscription, you actually are saving cost in terms of the cloud native cost too. And then there's also the special kind of key is called key shared. So everything is kind of essentially represented by the key too. That's key shared, okay. And basically it's good fit for bring use cases as well. And Kafka has challenges in that. So why pulsar? Right? So a bit, kind of wanting to kind of like two talk a bit too about why is it better than kafka? I wouldn't say it's like all better. Sometimes it depends on what your scenario is, right? If you are in a situation you don't need all of these fancier geore applications and giant kind of setup, then maybe it may not apply to you. But if you are interested to read about it, there's this link in here, and under confluent there's a topic written about Kafka versus Pulsar. And then another one is Kafka versus Apache Pulsar event streams comparison and then the features myths explore too. So when you should consider Pulsar, if you need both queues like rapid MQ and streams processing like Kafka, or you need easy geo applications, multitenancy is a must and you want to secure the access for each of your teams. And then you also want to persist all of your messages for a long time and you don't want to offload them to another storage. And essentially too performant is critical for you. And your benchmarks have shown that pulsar provides lower latency and higher throughput. So essentially too, yeah, you can guarantee the performance to be quite well. And if you run on prem and you don't have experience setting up Kafka, but you have Hadoop experience, these might be reasons why you want to use Kafka. So data stacks, flavors of Kafka, I'm sorry, of Pulsar. So I'll go real, real quickly. So data stacks is essentially taking pulsar ten times further. We have added our special sauce, so to speak, right? So all of these things are already there. And it's binary compatibility with JMS, MQ and Apache Kafka. We have libraries that helps you two for legacy kind of systems like transformation, we have that migration for you two. So think of us more like we enable you to be even more productive. So Pulsar meets you where you are. So again, I mentioned already Astra streaming that has managed Pulsar, there's Luna streaming, it's basically the open source, but we also provide or give you the options of processing enterprise support Pulsar, enterprise support that will support Pulsar. And then there's also the pure. If you want to go all open source, you can kind of use the open source version. It's all community driven comparing against Kafka. So some of these pain points, I just want to highlight a few, right? So basically Kafka doesn't have the separation of the compute and the storage. So it is, when you start off is maybe more straightforward. You know, how many topics, how many partitions, number of brokers you plan out ahead of time is there, and it's fine. However, if there's a need for you two, keep growing your system, then it may become a little bit more difficult because what do you do now? You already have defined a number of topics and partitions. If you want to do that, it's a bit harder to do. You can do it, but it just requires a lot more work, right? And essentially too, if you look into it, cluster rebalancing can impact the performance of connected producers and consumers too. And over here too, it talks about. Yeah, there's geo replication mechanism too, by for example, this library called mirror maker, but it's not very ideal at this point. And companies like Uber have also created their own solution to overcome these kind of issues. Two, so overall you can look at Pulsar as an extension of augmenting what Kafka augmenting it essentially, right. So partition centric versus segment centric. I think I already showed this to you. As you can see, it compares Kafka, which over here, right, the partition. Everything Kafka manages is storage too. So all log segments are replicated in order across brokers too, and then otherwise too, if you look into pulsar, right, bookkeeper, as you can see, bookkeeper handles all of these magic essentially for you, for the ledgers, for the journaling bookkeeper. It comes easier if you try to replicate, do any kind of multiplying and things like that too. And essentially it's built more for kind of like the scalability and it needs to rebalance. Right? That's another thing you need to rebalance if you need to grow your system, have more cluster, more nodes. And it actually grows nicely because pulsar already has built in that it knows to kind of rebalance all of these topics, how it is being partitioned, for example, and things like that. All right, architecture advantage, right? So compute and storage separation, we already talked about it. And segment oriented log messaging. Two over here. So where to go from here and let's keep in touch. Okay, so here, I think I'm running out of time now, two, kind of do some demo, but otherwise too, I wanted to bring to your attention. Right, these are the resources. Pulsar, apache.org, bookkeeper apache.org, zookeeper apache.org. You can read up on all of the details. And then at Datastax we offer Astra streaming over here or lunar streams, right. Lunar streaming essentially is all free unless you want to purchase the enterprise support too. So again, data stacks is very much about open source too. We are strong supporter. We have fills working on Cassandra and Pulsar committers and pmcs. And please follow my Twitch stream every Wednesday at 02:00 p.m. Us central time. I have my twitch stream that these days I'm going into talking more about event streaming and pulsar. I've been with Datastax for a little over a month so I haven't got as deep yet. But I will. And then I also have other topics too, like developers chat for example last week at Devnexis and things like that. And please join us at the hood. This is all new. We are efforts of starting off a website called Apache Pulsar Neighborhood is over here. Or like there are meetups two on meetup.com meetup group so please follow us two over there apache pulsar neighborhood and with that I want to thank you all. Thank you for listening to my talk and please I welcome you. Please stay connected with me. Join my discord server. I'll be happy to talk about anything right? And also follow me on Twitter and also my LinkedIn handle is right here with that. Thank you very much.

See all 50 talks at this event!

Conf42 Cloud Native 2022 - Online

April 28 2022

Event Streaming and Processing with Apache Pulsar

Video size:

Abstract

Summary

Transcript

Mary Grygleski

Streaming Developer Advocate @ DataStax

Join the community!

Featured event

2026

2025

Info

Conf42 Cloud Native 2022 - Online

April 28 2022

Event Streaming and Processing with Apache Pulsar

Video size:

Abstract

Summary

Transcript

Mary Grygleski

Streaming Developer Advocate @ DataStax

Join the community!