OpenTelemetry Is Not a Silver Bullet: Lessons from the Field

Video size:

Abstract

OpenTelemetry is changing the way teams think about observability, but it is not always smooth sailing. In this talk, we share hard-earned lessons from real teams who tried to make OpenTelemetry work in production. Hear about unexpected challenges, mistakes that resources,and solutions that worked

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, good evening. My name is malicious. I'm a snow workplace engineer. Today I'll be speaking on Atory is not a Super Blood lessons learned from the field. Before I proceed, I would like to talk about the need of open telemetry. Open Telemetry is needed in observe library because it's used in collecting telemetry data, which are used in finding out the behavior. Of your system or your servers, especially in distributed systems. The telemetry data are very important because it gives you an overview of what your system performance is like, and as well give you that, give you that opportunity to take proactive actions to make sure that your systems are running as expected. Now we'll be speaking on the vision, why Open Telemetry was created. Now, the first vision, why this was created was for Unified, unified Standard. And what this means is that there have been a lot of promises you know about. Getting single standard for tracing logs and metrics across distributed architecture systems environment, depending on where your system is sitting on, whether it is sitting on the cloud, or whether it's sitting on, on, on premises, or whether it is outside of the cloud. So the telemetry has been very important, in these environments to gather this data. Which gives us an insight of what is going on within our environment. And the reason for this was to make sure that there is a unified, monitoring across our environment and the second vision has caught to come with the vendor freedom. So there have been a lot of, promises around lowering the risk. On what we can do and how we standardize the data we collect from this environment. And this has to come with, with the promises we get from different vendors telling us, oh, how we can use this telling, how we can use this tooling systems or these platforms to achieve this or that. But at the end of the day, it boils down to. What we have in place, what we have in our team, how are we going through those processes to make sure that, we carry out all ities. We need to keep our systems efficient and performance a hundred percent wrong without any resource constraint. Now, on the other hand, we have been also been promised that the growing adoption, now there have been a lot of adoptions with, clouds adopting to different tools and system just as we, we can see let's say, clouds offerings like Microsoft Azure. And AWS Google Cloud compute. So there have been a lot of, promises around these areas as well. And if you notice there is, there are some sort of, platforms that are standard that for established is, for instance, we have the data dog. We have, we have og, we have logic Monitor. Logic Monitor is not really a big sheet for collecting data, but we can use these collectors to, try, collect data and to understand what is going on within the environment. And also there been a lot of open source projects around pun, which is, which seems to be, and therefore it. Very important that, a lot of organizations adopt to this growing tool, systems and also platform rather than going to platform it's best to these tool systems of as they would give us what we need to make sure that our environment are running as we expect now. In the reality in the real sense of the whole thing. So we are talking about the common pain points and we, which we have encountered, in our various field or experiences bu a DevOps engineer, platform engineer, software engineer, or a our, or if you are an SR engineer, depending on what your position is. So there. Challenges we, we encounter across these disciplines. And the first thing I would like to speak about would be the documentation challenges. So there have been a lot of documentation, errors, efficiency, making use of obsolete documentations that needs to be upgraded, using vendors or using tooling letters of this, which could, help us to propel our. Our collection of data using our collectors and this documentation challenges has brought a lot of gaps and redundancies, within teams, within organizations and within businesses, because no one is taking, no one is checking all of this. Formal things because people don't really take them seriously and this has created a lot of issues. So therefore, it is very important to make sure that your documentation is. It's compliant and it's treated as a regular activity rather than something that someone needs to pick up maybe once in a time. So it's very important that everyone ensures that the documentation is, is updated and things are occurring along to make, to avoid any issues that could impact your environment. Another issue we also face is the plate burden. Now this has to do with the sort of, this has to do with the type of software development kits and software development kit we use across. To, write our code, be it Java, be it Python, depending on what your organization uses to progress on the language your team are using. So it is very important that we measure that the software development kits or tools or IDs we use. Integrate with the platform we use, and we also, we need to ensure that we are not using outdated libraries or we are using always, or we are not using outdated, codes that could cause an issue in environment or even break our environment. So this are very important that we make sure that all these are being checked and we need to also ensure that. Every developer within the team, every DevOps engineering the team are using the standard code methodology to avoid problems. Now, lastly, we are going to speak about version conflicts. Now, when it comes to the version conflicts, so there have been a lot of incomp, which creates a constant friction amongst teams, amongst engineers, organizations. Causes a whole lot of issues, especially when trying to ingest data or when trying to collect. Data or when trying to trace or get logs to check what the system's going all about. So this, it just causes a whole lot of issue because we found out that there have been a lot of issues whereby, the fashion conflicts or due to in capabilities, or it could be that it does, it could be that the versions are not, are not compatible with our system. So it is very important that we make sure that all of these are being checked and we make sure that the tooling systems we are using the fashions we use to collect our data within our environment are compatible with our systems. Now moving forward, we'll be speaking about the scaling headaches, the collectors and infrastructure challenges. Now the first thing I'm going to speak about is two. Next. Now the open telemetry of basis performance issues as environment grows, processing pipeline can't keep up with the data volume. And now this is evident in the situation where you have, a, where you have an environment that is loaded with a whole lot of, data. There a whole lot of storage and nobody's checking them. There is no collectors monitoring this environment. There is no collector monitoring all the storages. Nobody knows when the storage gets filled up. So it costs a whole lot of problem. For, so in a lot of environment, I can't speak really about what happened. So we. So we we had a support server environment that we used of, for our support server and then also for our data warehousing. So we noticed that they cluster storage within this environment was filled up. And because there was no collector monitoring the environment, the. Story the storage got filled up. The disk for the server also get filled up. And this led to scratching of that environment. And this cost a whole lot of huge, amount of money for both our business and for our organization. So it's easier for, to make sure that they are collectors, deployed across this environment, taking real time performance and also taking taking real time performance so that, juniors ses, looking after this environment will be proactively, aware of what is going on so that in case of any issues, there will be there will be scaling up of add, also adding off of more storage within environment. Keep the environment, up and going. Now another one I would like us to speak about is the resource hunger. So it's very important to make sure we are not, using possible, servers to deploy our collectors. Because what happens, what I have noticed is that possible servers are not really mostly four. Demanding environments that need, that consume a whole lot of data because possible CPUs processors, it a whole lot of memory and ram and this leads this cause a whole lot of issue. So it is very important to make sure that you're using. Good specs for your environment, possible CCPs that are mainly meant for development and testing environment. So it's important that the businesses, organizations understand that they need something higher. Maybe for instance, enterprise, or something that will give their, like at least a CP or even more. Depending on what the business requires, but is impossible. CPU servers are not recommended, especially for collectors that monitors, huge environment that, takes your time data. So it's important that you, that organizations scale up and and also there's good monitoring with this environment. Another one I'll be speaking about would be the debt reliability risk. Now, with that careful architecture and that gets dropped or delayed, and single points of failure emerge in critical telemetry parts, and what this means is that we must ensure that while designing this environment, we may must make sure that there are no gaps. To where we can lose data and we must make sure that the tooling we are using for this environment are compatible and appropriate so that we will be able to, get the monitoring, know what is happening in our environment, understand the performance, and also scale up where needed. And this would, help in keeping environment. Up and running and have, and put us in a better position that will not cost us any issue in the future or cost us money or, or put us or in the bigger area. So it's important that while we are designing our infrastructure, all we are designing our environment, so we make sure that all these gaps are being covered to avoid any unforeseen circumstance Now. I'll be speaking on the gaps between promise and practice. Now, there have been a whole lot of, promise and practices, but the question is, are these promises fulfilling what we want? Are we practicing what we preach? Now I'm going to start with the stage and visualization gaps. Now, open telemetry defines data price, but not end user analytics. There is gaps between what Open Telemetry gives us and what we can get out of it. Now, it is true that Open Telemetry helps us to, get your time data in observability. That gives us a better, underst a better overview of what is going on with environment. But what we lack is how does this, help a single user. Are there any dashboards that we can use? Now, this is where we are lacking this area because we need to make sure that, there are dashboards like, for end users, which they can use to also understand what is going on with, within, their personal work, so it's very important that all of this are being put in check and done properly. Now. Secondly, we need to be speaking about the library politic Paris. Now we are talking about language. Software development case integration quality, which diverse widely. And now this has to do with, the type of languages we are using to ingest our data, collecting our data, whether Java scripts. So it's very important we make sure we are using the best programming languages that suit to our needs and business services as well. Lastly, team education challenges. Onboarding engineers require significant investment, and this is very important because it is easier to on onboard engineers to start working on these tooling systems. Data collecting logs, ingesting them on trying to triage what is going on in Sava. But it's very important while we onboard engineers, we provide them with the necessary training that would give them. That knowledge that would help them excel, in this position as a, as either as an SRE engineer or as a monitoring specialist. Because these are very important that engineers do not only onboard, but only understand the reason why they are trying, specific environment, what those data are used for how they can. Put this as a report and then explain to a wider range what is going on and how they are working proactively to make sure that what the data is giving them are being used as expected. So it is therefore important to make sure. So therefore, it is important to invest in training engineers to make sure they are well equipped and also ready to carry on with this. This task here, we'll be speaking about, the case study behind implementation pitfall. First of all, we will be, first of all, we will talk about the collective failures. Then we talk about the version drift. And last we talk about the monitoring blank spot, the collective failures. Leading software as a service providers face a whole lot of collective crashes, under peak loads, monitoring gaps and occurring critical operations. And this, it's evident with some software providers where, they provide a whole lots of, softwares that they know are not really, compatible with the environment. Or they know that what you are using within your environment could, proactively impact the software. So it is, therefore, it's important to measure that you're using the best, softwares that would work as we expect, that would take the datas in real time and we can analyze to, make informed decisions because. It is important that, when collector fails, it could lead us to bigger issues such as bringing the server down, such as affecting other, services that are, very important or critical to your business. So it's very, it's it's imperative that we make sure we are investing in the right softwares to collect monitoring within our environment. On the second one, we'll be spin about the fashion drift. Now we, it is true that weeks can be spent, on triaging, on issues, maybe library updates didn't match, company could base or whatever. So this is where we need to ensure that there is no, comp there, there is no compatibility issues within our environment, depending the type of version we are using for, getting our logs. So we need to make sure that the version which we are using matches with vendors requirement or with the tooling system requirement. So it is very important that. We are not using something that is not, meant for the software, because what this can do is it might work at the first instance, but it won't be able to give you that right result that would help you to succeed in charging the required locks. You need to proactively, make your environment better. Now, lastly. The mentoring blind spots, a lot of team, go there, they just set up the monitoring and then they go to bed, or they just feel like, oh, we have done this, or it is mentoring. And what happens is that most of all these, engineers or companies is they follow the absolute way of monitoring servers environment without putting that, unnecessary actions, which has to do with testing it, testing the monitoring if it is working, checking the dashboards, triaging, taking logs in real time to understand. What is going on? Just to make sure that everything works as expected before saying yes, we have put on this monitor in this. I think it is very important to carry a sort of testing before, closing off and say, yeah, we are done with this. Because what happens most of the time is, they dis engineers to go and, set up the monitoring and then they just leave it and they don't even try checking. To, should I check in, the methods that are using, if the schemas are updated, if they need to bring new version, new code version, or if they need to bring something that would help, make sure that their monitoring is working as expected. So it's very important to make sure that there is a sort of testing and confirmation before, handing off to other task. So the lessons I have learned and what I'm, what really works from my own understanding, I think I would say start small. So why I would say start more is that, it is very important that you, that gets critical systems and limit the initial scope before widely two is. Custom documentation, investing in internal training, tailoring, tailored to your specific tech stack depending on the tools you're using to collect your data is very important. Making sure you understand that tooling, making sure that everybody's involved, making sure that, the knowledge base is all, is 24 hours updated. With the latest information, with the latest steps, latest processes, making sure that everybody is part of this as a culture and not just as a team, but it needs to be something that needs to be passed on from, from first line to the deadline to the highest person. So it is important to ensure that. Putting an internal documentation, not just only for a technical team, but for the whole range so that anyone that is documentation would understand what is being done and what is being said. Now, another thing I would like to speak about is the interactive expansion. I understand that sometimes, engineers might want to add a whole lot of. Servers just to maybe speed their work or, make sure that the monitoring is going as they have expected. I think it's very important that, systems are not gradually added, as the team grows and as everyone gets think there with what he or she needs to do. And also it is very important to, add assistant. Only based on the power of your resource environment. And lastly is the regular review. This is where discipline monitoring of schema resources and integration points comes in. So it is very important that, we regularly review our monitoring. We set out time. It could be allocated to different engineers at a specific time just to know what is being monitored, to make sure that they, the data, the performance threshold, all of this are being, cutting as the team has or as expected. So it is very important that these regular checks are being carried at rather. Once in a while, maybe it could be done, maybe every once in a week, maybe every one time, maybe once in, in a week, or twice in a week, depending on the availability of the engineer. So it is very important that this is being carried out and everyone is carried along while this is being done as well. Now, this is realistic, the realistic expectation and practical recommendation. Toolkit and not platform. I think it is very important to, we understand that open Telemetry provides powerful companies, not a finish observability solution. So it is important that we focus and invest more on the tool. Rather than on the platform. So what I mean by rather than the platform is, you get told, oh, we need to buy Microsoft Azure. You need to buy, AWS subscription, depending on the needs. Oh, we have different types of monitoring tools that can help you to, scale your business and get what you want or get your data. I think it's not, I don't think that is the best way to go about it because it, at the end of the day, you end up spending more money on different resources. So I think it's best you go for one tool, that will be, that is solely built for. Collecting, tracing and gathering of metrics of your telemetry data, maybe such as data drug or what, know this is just an instance, right? So I think it's better. You think it's better? Organizations and teams go for a specific tools rather than going for a platform. Secondly, ongoing investment Now. Sources. I would like to read what this initial this this headline speaks here. It says that sources require engineering resource, resources, organizational alignment, and understanding system risk. And how I would explain this is sources require resources means that everyone needs to be trained. Engineers needs to be trained. They need to have the right resource, they need to have the right knowledge. There is no knowledge gap. And also that they also, and this has to align with the organization as well, what the organization wants to get at from, these engineers, what the expectations are, and at the same time, authorizing the critical systems. So this knowledge base, the tooling system needs to be needs to be pro, needs to be prioritized within the critical systems to avoid any non, any risk that could cost, could party the business or the services and why organizations are also investing on the engineer. So it is important to measure that they're also investing on the infrastructure. They're always, a, an infrastructure review. And also there is always an upgrade. Following up with vendors, making sure that, the service systems are run on the latest fashion. I think it's important. To put an investment into this because this was the goal, would lead to a, fruitful success both for the business, both for the ENG engineers, and the systems will be happy if you know the site, the right resources are being provided now. The last will be the build complete solutions. Now, in as much as. There are a whole lot of benefits we can get out of in open Telemetry. I think at the same time, it's important to build robust Latin and virtualization layers around open telemetry because Open Telemetry is just us as a foundation, but if we have you a robust, alert and virtualization layers and other toolings that could help. In making the work easier, I think it's very important that these are being considered because this would help us to actually get direct result, which we need. So I would say that, businesses users, engineers, we need to, put their, put all attention to this rather than just using one tool system that does everything I think. I think it, it's, I think it would make sense if they can spread their hands to get other, tooling systems to add to what they have as a full, complete solution. Conclusively. I would say that Open telemetry unlocks control of observability with that elementary complexity. And this is where the flex, the flexibility comes in. And also it. Also, teams must treat observability as ongoing work and not onetime setup because there needs to be a continu, a continuous engineering across teams, across business, across board, across, across, the managers the stakeholders, all of them needs to be involved because they need to be told or be aware of what is going on in the environment. So I think open telemetry of, offers us a good values and all of this needs to be put in place. And lastly, the pragmatic benefits. And this is where, iT approach, which is Partal approach of national, realizing real world values beyond, the hypes, making sure that you know everything they need is being done. So I think it's very important that we make sure that there are accurate technical documentations, engineers are being trained, and there is a good investment on the tunnel system which we use to. Try to collect our data and also, proactively put our environment in a good state. So I would say that open telemetry, is a great asset to use is a it brings a whole lot of benefits and also whole lots of true values to businesses. So businesses needs not only need to, invest on platform, but excessively on the 30 systems that works best for the services. Thank you. It was nice speaking with, it was nice speaking to you all. Bye.

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

OpenTelemetry Is Not a Silver Bullet: Lessons from the Field

Video size:

Abstract

Summary

Transcript

Meletius Mgbeodichimma Igbokwe

Senior Modern Workplace Engineer

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

OpenTelemetry Is Not a Silver Bullet: Lessons from the Field

Video size:

Abstract

Summary

Transcript

Meletius Mgbeodichimma Igbokwe

Senior Modern Workplace Engineer

Join the community!