Conf42 Cloud Native 2022 - Online

OpenTelemetry and Epsagon - A Love Story In Three Acts

Video size:

Abstract

Epsagon developers use OpenTelemetry excessively to create a sustainable observability product. But it wasn’t always like this. This talk will share the story of Epsagon adopting OpenTelemetry into its systems, the mistakes that were made in the process, how it became a part of the OpenTelemetry community, and how it all came together with Epsagon being acquired by Cisco.

This talk will cover:

  • Recent history of observability with an emphasis on OpenTelemetry
  • The different paths to use open source projects in general, and OpenTelemetry in particular, to create valuable products for your customers.
  • How to become a part of OpenTelemetry
  • Pitfalls to avoid when using OpenTelemetry (and open source in general)

Summary

  • Yosef Arbiv: Today we are a part of Cisco and we build a new product that supports Opentelemetry natively. He explains how we got from where we started to where we are today on the mistakes that we made along the way. If you want to use open source projects as part of your solution, this talk is for you.
  • We managed to create a good solution for customers using serverless frameworks. But the market was too small for us and we couldn't build a big business on it. forks are really hard to maintain. There are much better ways to use open source than forks. Growing the tech debt without control over it can be a problem.
  • Cisco joined the open telemetry community and joined Cisco. Joining the community was a great experience. Being a part of a growing open source community brings lots of value to the developers in my team. In the future, we hope to be a significant part of the Opentelemetry community.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk. Opentelemetry and Epsagon a love story in three acts when we started Epsagon four years ago, this was a team and we used proprietary sdks and fraternity protocol. Today we are a part of Cisco and we build a new product that supports Opentelemetry natively. We contribute code to Opentelemetry repositories on a day to day basis. If you want to use open source projects as a part of your solution, this talk is for you. I will explain how we got from where we started to where we are today on the mistakes that we made along the way and what you can learn from them. First, a little bit about myself. I'm Yosef Arbiv. I am married to Abi and I'm the father of three adorable kids. And I'm team leader of the SDK team that builds the libraries and the sdks that our customer use in the Cisco Etni group. So a little bit about Epsagon so you can understand how we fit in. Epsagon is a solution for customers who use and build their product, their backend, with lots of different services and frameworks in the cloud. As you can see, in such cases it can be very hard to keep track of what you have and who talks to who. So this is where Epsagon comes to help our customers install the sdks as a part of their code. They don't need to change anything in their code, they just need to import our sdks into their code and then they can log in in our website and they can see the traces and graphs of the product. Who talks to who and where do they have failures, where are the errors, and they can debug and travel to their systems. So in order to generate these graphs, we need to send traces from the code. So this is where my team came in place. We build those sdks, that instrument the code, create the traces and send them to the back end. So first where we started, when we started, the serverless market was new and trending, and we decided to aim for customers using serverless solutions. Back then there were no industry standard for distributed traces. Most of the customers were using logs or nondistributed traces. We looked at the possibilities we had and we considered using open tracing, but we had a couple of problems with it. First, open tracing was backed by one company which was a competitor of ours, so we were a little bit afraid of using it. Another problem wasnt that open tracing didn't have automatic traces back then. We wanted the experience for our customer to be as smooth as possible and to have minimum code changes. So we decided to create our own libraries that will do automatic instrumentations and will create traces automatically. At first we considered to create those packages closed and to send them to our customers to use. But soon enough we discovered that this was not really a possibility. Our customers didn't want to install closed source packages and to add them to their sources. They wanted the code to be open so they can see it, they can fix bugs and so on. So we decided to open source our libraries and to publish them. And we also hope to create a little community around them where customers can fix bugs and add new instrumentations and so on. So this was the first phase when we started Epsagon, what we learned from this phase. So first about the product defensibility, you need to think which part of your product you want to be closed and which part should be opened. Focusing the defensibility on the wrong part can be problematic. Another thing that we learned is that building an open source community can be really hard and requires a lot of energy and resources. It is much easier to join a community than to create one on your own. And this brings us to the second act, the standardization of the market. We managed to create a good solution for customers using serverless frameworks and we were very popular at this market, but this market was too small for us and we couldn't build a big business on it. So we decided to move to other fields, such as the Kubernetes clusters. But when we looked into it, it turns but that to build a Java Kubernetes agent on our own was too complex. So we looked into open tracing again and it was moment short then. So we decided to build our agent on top of it. So we decided to take some code from opentracing libraries and to add the code that was needed for our solution, and to add some changes, to add the tracing protocol that we needed, and so on. And this way we could build a successful Java agent for Kubernetes clusters that was based on open tracing but was compatible for the Epsagon backend. Shortly afterwards, Opentelemetry was announced. Opentelemetry was based on open tracing and open census, and it became very popular pretty soon. So we started to build new libraries that were based on open telemetry code on the same way we did with open tracing. We took the libraries and the code from Opentelemetry and we changed it a bit. We add the unique functionality that we needed for Epsagon that was not included in open telemetry libraries. And this way we created our own libraries as a forks actually of Opentelemetry. This way we were able to create new libraries very fast. So we created more and more libraries, but maintaining them became a headache. It was really hard to maintain the libraries when the community keeps moving forward and changing the code, adding new functionality. So it was very hard for us to keep track with the community. Which brings me to the lesson learned. So first, forks are really hard to maintain because you can't really keep updating your code with changes and additional code that community added. So I really suggest not to use forks when possible. There are much better ways to use open source than forks. And the second one is the tech debt. We created a lot of tech debt when we move forward like this, we managed to add new libraries and to add new functionality to our product. But as we did it, we create an increasing tech debt. In our case, eventually this was not a problem as you will see in a minute, but in other cases it can be really problematic. So this is something that should be considered again, for some cases it can be actually good to create a tech debt because you keep growing your product and you move forward and you add new functionality and new features. So it can be great. But growing the techdet without control over it can be a problem. And these problems brought us to the third phase, joining the open telemetry community and joining Cisco. As we had more and more forks, the overhead became too big and we understand that this was not really scalable and we can't move on like this. We also had more customers talking about Opentelemetry and customers that were using Opentelemetry in their code. They wanted to see opentelemetry traces together with epsilon traces in epsilon backend. So to answer this need, we decided to create a small experiment with a Java agent. We built a new agent that was not based on Opentelemetry as a fork, but as a distribution of opentelemetry, meaning that we used opentelemetry as a package and created more functionality on top of it. For Epsagon backend. We needed to collect more data that Opentelemetry were not collecting. So we added this as an extension to Opentelemetry, but we keep the Opentelemetry traces structure. This experiment was really successful. We were able to build very fast an agent that was built on top of opentelemetry, but without forking the code of open telemetry. So updating it and maintaining it was much easier. In addition, we were using open telemetry trace structure, which means that our back end now was able to support open telemetry based traces and not only the Epsagon traces. So we were more friendly for the communities. Shortly after this successful experiment, just when we were about to create more libraries in this structure, it was announced that Cisco are acquiring Epsagon. For us, it means that we will stop working on the epsagon product and we will start working on a new product on the full stack of scalability product. Together with Cisco groups, we decided that our new product was supporting. We decided that our new product should support Opentelemetry natively, meaning that we will be able to provide value for customers using only Opentelemetry without Epsagon libraries and to add more value for customers who are using Epsagon libraries. We also decided that our libraries will be based on Opentelemetry as a distribution and this way we can create libraries wasnt and also we were able to maintain them as we move on. As we moved into this phase, we also joined the community, meaning that we started to contribute code, to add new features and to fix bugs in Opentelemetry projects. Joining the Opentelemetry community was a great experience. First, we grouped with Appy Dynamics team and other teams at Cisco who were already contributing and working with Opentelemetry. They had a lot of experience with Opentelemetry code and we learned a lot from them. We met new maintainers of Opentelemetry and we worked together with them, learning from them, asking them questions, and understanding how the community works and what we need to do in order to fit in. I can also say that being a part of Opentelemetry community is great for the developers in my team. We love to be a part of something bigger and to be able to contribute back to the community. Being a part of a growing open source community brings lots of value to the developers in my team and this is very important as well. In the future, we hope to be a significant part of the Opentelemetry community. We aim to contribute as much as possible from our distributions back to the community. This way, we want Opentelemetry to be a major part of our full stack observability solution for our customers at Cisco. Thank you for joining me. If you have any questions, feel free to reach out at Twitter, LinkedIn or discord. Thank you and see you there.
...

Yosef Arbiv

R&D Team Leader @ Cisco

Yosef Arbiv's LinkedIn account Yosef Arbiv's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways