Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Today I am talking about how do you scale
            
            
            
              open telemetry collectors using Kafka? To introduce myself
            
            
            
              I am Pranay. I am one of the founders and maintainers at SigNos.
            
            
            
              I have been working as product manager in the past. I was product
            
            
            
              at Microsoft and multiple other startups in
            
            
            
              my pastime. I love reading and taking
            
            
            
              a walk in the nature just to set context.
            
            
            
              Why I'm talking about open telemetry collector so
            
            
            
              as I mentioned, I am maintainer at signals. It is an open source observatory
            
            
            
              platform. We are have been there around for
            
            
            
              three years. We have around 16,000 70,000 GitHub stars,
            
            
            
              4000 plus members of slack community, 130 plus contributors
            
            
            
              and it's open telemetry native single
            
            
            
              pane in glass for auxiliary. So we have support for different
            
            
            
              traces, metrics and logs and you can see all
            
            
            
              of them in a single pane and correlate across them.
            
            
            
              And today I'll be talking about our experience with
            
            
            
              signos cloud and underneath we use open telemetry
            
            
            
              collector and how we scaled it and our experience
            
            
            
              regarding that. So that's where I'm coming from.
            
            
            
              Just to set context for people who are not aware what is
            
            
            
              open telemetry. So open telemetry is a
            
            
            
              CNCF projects, it's a vendor neutral standard for
            
            
            
              sending telemetry data from your applications
            
            
            
              and it supports all the signals.
            
            
            
              So you can send metrics using it, you can send traces
            
            
            
              using it, you can send logs using it.
            
            
            
              And this is sort of becoming now the default standard
            
            
            
              how people do observe. This has helped
            
            
            
              and so why is it important?
            
            
            
              So open telemetry is the second fastest growing project in
            
            
            
              CNCF and this is just after kubernetes, so you can
            
            
            
              understand how popular it is. What it enables is it
            
            
            
              enables sending data to any
            
            
            
              backend. So it standardizes on that elementary protocol
            
            
            
              through which you send data or applications can send data,
            
            
            
              infrastructure can send data to a backend.
            
            
            
              This is now becoming the default standard for instrumentation
            
            
            
              and people can use any backend which supports
            
            
            
              open telemetry for their use cases.
            
            
            
              The big advantage it gives is that you are not users are now not getting
            
            
            
              vendor locked in to a particular ecosystem and hence have
            
            
            
              more flexibility. So this has promoted a
            
            
            
              lot of innovation in the ecosystem. The other key advantage is that
            
            
            
              it's, it has been a unique standard
            
            
            
              in the sense that it supports all the three signals, metrics,
            
            
            
              traces and logs from get go. And also new signals like profiling
            
            
            
              is in progress. And at Signos we are
            
            
            
              natively based on open telemetry. When we started the project in 2021.
            
            
            
              This we took a bet on just
            
            
            
              being natively based on open telemetry. That's the only SDK support.
            
            
            
              And most of the things we do, we try to rely on open telemetry
            
            
            
              as much as possible. So today in our talk, we'll talk about if
            
            
            
              you have used open telemetry collectors or have tried it for setting
            
            
            
              up your ability, how you can scale it and,
            
            
            
              and use it for like huge scale.
            
            
            
              And I'll talk about our context on
            
            
            
              how we leverage is for a signos cloud product.
            
            
            
              And hopefully it will be helpful for you to
            
            
            
              get a sense of where things are. Cool. So now
            
            
            
              we understand what open telemetry is.
            
            
            
              One of the key components of open telemetry is the,
            
            
            
              is the open telemetry collector. You can think of it
            
            
            
              essentially as a pipeline through
            
            
            
              which you can send data, process data and send to
            
            
            
              different destinations, right? So there are three key components
            
            
            
              of open telemetry collector. There are receivers,
            
            
            
              there is a processor, and then there are exporters. So through receivers
            
            
            
              you can receive data from different formats. For example, you have
            
            
            
              host metric receiver, which can receive data from
            
            
            
              machines and infrastructure metrics like
            
            
            
              cpu's memory res. There's a Kafka metric
            
            
            
              siever where you can get metrics from Kafka.
            
            
            
              There are like 90 plus such receivers which enable you to
            
            
            
              receive data from different sources. The next is processors,
            
            
            
              which basically helps you do different type of processing. So you
            
            
            
              can do change attributes. You can do filtering of particular type
            
            
            
              of logs, traces or metrics.
            
            
            
              And then you export it to new destinations. So you
            
            
            
              can export it to destinations like click house.
            
            
            
              You can export it to back end providers like signals.
            
            
            
              Or you can use any other exporter like Kafka
            
            
            
              exporter tools to send data across by
            
            
            
              Kafka. So open telemetry collectors are really
            
            
            
              the key components in open telemetry. And we'll
            
            
            
              focus on how we have used open telemetry collectors
            
            
            
              to, to provide this, our signals cloud service.
            
            
            
              Just to take you a bit deeper into our
            
            
            
              signals cloud for the singleton architecture. So without,
            
            
            
              and this is without Kafka, we are
            
            
            
              both multi tenant and single tenant architecture. In this talk, I'll primarily focus on
            
            
            
              the single tenant architecture because that's where we
            
            
            
              have used Kafka a lot to scale
            
            
            
              this architecture. So imagine
            
            
            
              there's a customer who is using otill collector as an agent and
            
            
            
              sending data to signals backend,
            
            
            
              or they're sending directly data through applications,
            
            
            
              right? So in architecture which
            
            
            
              doesn't involve Kafka, it will, they will send data directly to a load balancer
            
            
            
              and then there will be then directed to the individual
            
            
            
              tenant total collectors. So in this architecture,
            
            
            
              signos tenant has their own hotel collector.
            
            
            
              The load balancer points to like sends data
            
            
            
              from a particular customer to their specific signals hotel
            
            
            
              collector. Right. The problem with this architecture is
            
            
            
              that if the tenants go, go down
            
            
            
              or the dbs in this tenant have issues,
            
            
            
              then the agents start. The hotel collector here starts
            
            
            
              giving five xx that leads to loss
            
            
            
              of data. Also if the tenant has
            
            
            
              to, like if a customer suddenly has spiked
            
            
            
              their ingestion rate, so say they have made
            
            
            
              it ten x it in say few minutes. This leads
            
            
            
              to like if it is directly collected to a single hotel collector,
            
            
            
              maybe the hotel collector doesn't scale as quickly and
            
            
            
              that will lead to loss of data for the tenants.
            
            
            
              Right. So there's always some time which the tenant
            
            
            
              DB takes to scale up and during that time there
            
            
            
              will be loss of data for the tenant.
            
            
            
              So effectively customers may see some data getting dropped,
            
            
            
              which is obviously not a good thing. Right?
            
            
            
              So this was some of the problems which we immediately
            
            
            
              identified with a single tenant architecture.
            
            
            
              It seemed imperative that there should be a queuing system in front of
            
            
            
              the after the gateway or kettle collectors,
            
            
            
              right? So in this example, the,
            
            
            
              here is the load balancer. So the load balancers. So the signals
            
            
            
              customers send data to the signals load balancer.
            
            
            
              And then that talks to a gateway of collectors.
            
            
            
              So this is basically a fleet of urgently scaling
            
            
            
              which received data from the customers.
            
            
            
              We use something called a Kafka exporter. So as I mentioned earlier,
            
            
            
              hotel collector has something called exporter
            
            
            
              and receiver and processor. So in this
            
            
            
              example we are showing hotel
            
            
            
              collectors, gateway hotel collectors, they,
            
            
            
              they receive the data from the.
            
            
            
              And then we use Kafka exporter and the Sotel collector
            
            
            
              to send data to Kafka. So we have Kafka
            
            
            
              setup here. I'll get into more details on what is the setup and
            
            
            
              the configuration for that. And then in the signals
            
            
            
              tenant, which also has auto collector in it, we enable the Kafka
            
            
            
              receiver and get data from Kafka there.
            
            
            
              So as you can see here, sort of,
            
            
            
              even if there is a huge spike in
            
            
            
              load from a particular customer, the Kafka acts
            
            
            
              as the queing system in between to absorb
            
            
            
              that spike. And then as the tenant system,
            
            
            
              hotel collector gets scaled up, it can start consuming
            
            
            
              at a higher rate. So there's no loss
            
            
            
              of uh, packet draws which we had earlier where
            
            
            
              there was no Kafka in place. Yeah. So these are some
            
            
            
              of the use cases where Kafka can be helpful.
            
            
            
              So this enables a highly available ingestion we
            
            
            
              have. So this is our current Kafka setup. We have 6 hours retention
            
            
            
              period, depletion factor of three, and then a ten
            
            
            
              mb max message, right? So Kafka,
            
            
            
              as you know is higher, is we can, we have configured
            
            
            
              it to be highly available. So with a reflection factor
            
            
            
              of three. So Avka acts as a buffer for 6
            
            
            
              hours. It's, it has much higher availability,
            
            
            
              it can handle busty traffic, tenant can
            
            
            
              continue consuming at their rate and
            
            
            
              have some time to scale up as they need. While the Kafka acts as sort
            
            
            
              of the single line of defense against a very high spike
            
            
            
              in loads. These are the two key main factors.
            
            
            
              But one of the advantages, parallel advantage to this is that
            
            
            
              this can also enable lot of additional processing,
            
            
            
              which we can do at Kafka. For example,
            
            
            
              especially in the case of database sampling for traces.
            
            
            
              Sometimes people want to filter traces
            
            
            
              by trace id and accumulate all traces at one place and
            
            
            
              then do some processing on it. For example, if you
            
            
            
              want to reject or like not store traces which has
            
            
            
              particular attribute in it. For example say if you want to reject all
            
            
            
              traces which have health check endpoints, right? Which essentially don't add
            
            
            
              value in case, if there are just hotel collectors
            
            
            
              you would have to challenge. You'll have a challenge of like having
            
            
            
              all hotel collectors in the case where they're like multiple
            
            
            
              order collectors to like all spans come to the same hotel collector,
            
            
            
              right? Because auto collector by natal is horizontally
            
            
            
              scalable, so you don't really control which hotel collector which span
            
            
            
              will go to and they are stateless. In Kafka,
            
            
            
              this problem can be solved much easily because you can use
            
            
            
              trace trace id as a partition key and that would
            
            
            
              enable all trace ids to go to a particular and
            
            
            
              hence you can do all those trace tail based
            
            
            
              sampling much easily there. So having a
            
            
            
              queue system in between helps you doing a
            
            
            
              lot of additional processing, especially if you want to do tail based sample
            
            
            
              tail sampling at the auto collector level.
            
            
            
              Much easier, right? So as I said, this is
            
            
            
              our Kafka setup. So just to give
            
            
            
              an example, these are some of the typical things
            
            
            
              we monitor. So we like if you have set up Kafka,
            
            
            
              you will also need to monitor like how many records are getting
            
            
            
              produced, what's the,
            
            
            
              for each tenant, what's the sort of difference
            
            
            
              how much records are getting consumed? So the difference between
            
            
            
              the number of records produce and consume basically tells you like what
            
            
            
              you need to do. Is that a lag in between if
            
            
            
              you need to scale your system or not, right?
            
            
            
              So if, so, when you set up our system
            
            
            
              here with Kafka we actively monitor
            
            
            
              all this, all these metrics and
            
            
            
              as you can see we use signals to monitor signals so this is a case
            
            
            
              of the monitor getting monitored itself and
            
            
            
              we use signals internally heavily to monitor all our SAS
            
            
            
              customers and use cases so you can monitor
            
            
            
              all different types of kafka queues here
            
            
            
              how many records are being consumed, how many records are being generated
            
            
            
              monitoring consumer lag is very important so this gives you
            
            
            
              a sense of like hey like how much ingestion is
            
            
            
              there and how much is the individual hotel collector is
            
            
            
              consuming right? For example in this case as you can see like for
            
            
            
              example a particular customer starts sending in data
            
            
            
              at a very high rate and this signal,
            
            
            
              this like the tenant for that is consuming a particular rate, right? So if
            
            
            
              the initial reads rate suddenly spikes then
            
            
            
              this the consumer lag here will
            
            
            
              increase and that we monitor actively to
            
            
            
              throw alerts and so the act of the signal let hey like
            
            
            
              maybe the signals tenant needs to be automatically scaled
            
            
            
              and like we need to more add more resources there so this
            
            
            
              scaling happens automatically but consumer lag is like one of the pointers
            
            
            
              which helps you decide what to and when to
            
            
            
              scale the tenants, right? So we monitor
            
            
            
              all this active like consumer lag actively
            
            
            
              and what are the number of kafka
            
            
            
              cluster. So we use red panda internally to monitor all this so how many
            
            
            
              are the red panda brokers? What is the consumer
            
            
            
              lag which is there in between? One of the key factor which
            
            
            
              we use for scaling is to scale
            
            
            
              based on consumer lag so how many partitions
            
            
            
              to increase or not and then this can also be used
            
            
            
              as a metric to scale up your consumer group,
            
            
            
              right? So you and tenant so if you see the number of
            
            
            
              like the consumer lag is increasing a lot we write scripts
            
            
            
              to automatically scale this up basically
            
            
            
              enable higher ingestion like higher consumptions from the hotel
            
            
            
              collector tenant tenant altar collectors and that
            
            
            
              that reduces the consumer lag and basically the systems get stable
            
            
            
              again so kafka acts as a buffer in between to
            
            
            
              handle workloads in the scenario if there was no kafka
            
            
            
              as as we had discussed earlier then
            
            
            
              suddenly this hotel collector will get overlapped and there
            
            
            
              will and if it doesn't scale at the same rate then
            
            
            
              it will start dropping packets if Kafka is present then that
            
            
            
              doesn't happen so consumer lag is
            
            
            
              an important thing to monitor that
            
            
            
              helps you scale kafka also one thing
            
            
            
              which accumulated monitor is consumer latency so what is the
            
            
            
              producer consumer latency so here if you see we have plotted
            
            
            
              how much time it takes for the producer to produce and get into
            
            
            
              Kafka. And what is the latency which like the consumer has.
            
            
            
              And if this latency increases by a particular amount we
            
            
            
              throw out alerts and signals and that basically indicates that hey, there is
            
            
            
              something wrong and then maybe some steps needs to be taken to fix
            
            
            
              it, right? So till now the
            
            
            
              kafka base architecture is working quite well for us. It enables very
            
            
            
              fast ingestion, it can handle spikes
            
            
            
              in loads from workload customers and data
            
            
            
              is retained for 6 hours and that works.
            
            
            
              We even get a huge compression factor of ten to 15 because
            
            
            
              hotel collector by default works in a batch model. So it
            
            
            
              sends data in batches and before ingestion into Kafka
            
            
            
              we are able to get a huge compression before sending that.
            
            
            
              So in that sense also it works quite well and
            
            
            
              it handles spikes, as I mentioned earlier, also very well.
            
            
            
              There are few areas where I think there could be improvements
            
            
            
              which we are working on currently. So can we make this
            
            
            
              automatically increase of based on just on the
            
            
            
              scale of ingestion of a topic, if a partition gets stuck
            
            
            
              for a tenant total collector, then can we
            
            
            
              use some methods like dead letter queue to drop
            
            
            
              after a few retries so that it doesn't get stuck in a permanent
            
            
            
              failure. Also making the whole tenant
            
            
            
              hotel collector which is like Kafka receiver to processors to exporter
            
            
            
              a synchronous module so that consumer commits
            
            
            
              an offset only after the message is successfully returned to a DB.
            
            
            
              So there's some guarantee on when the
            
            
            
              message is being written and the other
            
            
            
              key pieces like hey, can we make this exactly one delivery?
            
            
            
              Which as you know in our queuing processes are not easy
            
            
            
              to do right? So these are some of the improvements which we
            
            
            
              foresee in the future and which you're working on to solve. But overall,
            
            
            
              just adding Kafka on the Autel character has been
            
            
            
              a step improvement for us and hopefully for
            
            
            
              other people, other teams which are running the set scale,
            
            
            
              they can take this as a guide.
            
            
            
              So that's all I had for my talk.
            
            
            
              Just want to give a shout out for signals. So we are actively
            
            
            
              growing community in using open
            
            
            
              telemetry and we are a complete authority stack so you
            
            
            
              can check out all signals repo if you try
            
            
            
              it out create feel free to create an issue or participate
            
            
            
              in our slack community. As I mentioned, we are quite
            
            
            
              active slack community where you can come and get your questions answered and
            
            
            
              if you have any follow on thoughts with me, feel free to email me
            
            
            
              or that's all from my side.
            
            
            
              Looking forward to hear more from you and feel free to reach out to us
            
            
            
              if you have any feedback. Thank you.