Conf42 Observability 2023 - Online

Hacking OpenTelemetry: manipulating the observability pillars

Video size:

Abstract

Thank to OpenTelemetry, you’re allowed to welcome data from different platforms and/or applications those are different in logic and working point of view. Become agnostic from every type of source thanks to collection, manipulation and re-transmission capabilities of OTel Collector.

Summary

  • Hacking opentelemetry is a two-hour session on open telemetry. We will mainly focus on distributed traces, one of the three observability pillars. We'll see later and transform the collected data, showing also a little demonstration.
  • Open telemetry was stated as a standard de facto for observability. Traces instead are able to correlate user behavior with the last query performed on the DB. The open telemetry collector is one of the most important artifacts in the project. These kind of solutions are important to understand where the market is going.
  • A trace is composed of spans that are like a measure unit in this field. Each span represents a step in the request timeline. Can we make the system observable in some way? Fortunately, the answer is yes. We use Otel collector for those Andor how it can be manipulated.
  • Opentelemetry allows you to define pipelines and move data wherever you prefer. Become agnostic from every type of data. For any question about this session or our offer, feel free to reach us through socials or via email.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi all and welcome to this Conf42 session called hacking opentelemetry. My name is Andrea Caretta and I am a senior consultant for liquid reply it company and I'm here with my colleague Alberto Gastaldello as observability expert. Our job consists in searching and or designing monitoring solution for system at full stack levels, infrastructure applications Andor front end user behavior, adopting both enterprise platform and open source tools reason why our path collided with opentelemetry and we never left that way. Just a quick summary about what we're going to show you today. We would like to start with a brief introduction about how we handle observability and we consider it so important. Proceeding with the clarification of what open telemetry is, we will mainly focus on distributed traces, one of the three observability pillars, in order to explain how we managed to hack the tool. But be careful, it's not properly only a tool. We'll see later and transform the collected data, showing also a little demonstration. So let's start with few observability tips. In system theory, observability is a property able to measure the level of internal states, determination and interpretation given input data and external output. It's a system attribute, of course, not an activity or a tool. I could adopt tools to reach observability, for example. It's a kind of mindset instead of a result, the disposition to realize a system, not only to work, but in addition to be observed, to be seen. Let's represent my system as an iceberg where the visible part it's a kind of a black box where I'm able to understand information only through outputs, through symptoms. Explain with the red matrix request rate, error and duration. I have to open the black box in order to understand the causes of the outputs. Adopt the use metric utilization, saturation and errors to perform the root cause analysis logs are important, the same as a diary of the system events and many additional pieces of information. Traces instead are able to correlate user behavior with the last query performed on the DB, introducing a cause effect concept in a single occurrence compared to the aggregated values of the matrix. All of them contribute to fully understand the system behavior, both in fully operational and anomaly situation. So what is open telemetry or better, why Opentelemetry? Open telemetry was stated as a standard de facto for observability. During the last years, many leaders, cloud providers or enterprise observability platform companies contributed to its development, starting as a framework, spread it in different programming languages, then it was included in the CNCF as an incubating project where every pillar for which opentelemetry defines semantic rules and references as a maturity level. Only Tracy specs was released as a stable version, and this is one of the reasons we're mainly focusing on this pillar instead of the others. Finally, new softwares with different purposes than monitoring started to adopt open telemetry standards to generate valuable telemetry data and provide them to external tools out of the box. These kind of solutions are really important to understand where the market is going. Auto is vendor agnostic, since new companies are not always so confident to let agents to be installed on their products. On the opposite, they prevent every kind of external monitoring. As I said before, opentelemetry defined data types, operations and semantic conventions in the beginning and officially born as a framework composed by different sdks in different languages to be included in an applications project. Then monitoring capabilities were synthesized in agents able to instrument scripts and virtual machines, j in Java node js, Python net and many other technologies. And after a while a kubernetes operator was created in order to instrument pods with agents in microservices clusters. So we currently have the chance for supported technologies to obtain data out of the box without touching any line of code. Last but not least, the open telemetry collector is one of the most important artifacts in the project. Those is a kind of proxy able to receive opentelemetry data in different formats, editing and or filtering it, and sending it in different formats to the desired backend. Taking a look to the dataflow, it's clear how it's really fundamental in environments with heterogeneous data sources to translate in expected format the information and convey them to target in different protocols, different compressions and different serializations. The most important thing to understand from here on out is that opentelemetry collector is the tool we hacked to perfectly andor fully handle telemetry data. So let's go deeper inside the distributed tracing world. Aderto, tell us your point. Thank you Andrea. Let's start with the definition of the free fundamentals concept in distributed tracing. The trace is the word request record that follows those path top down. Through the architecture of a system. A trace is composed of spans that are like a measure unit in this field. Each span contains all the details of operation that are made within a service, allowing us to know how much time the step has taken. But how can we correlate all spans that are related to a single request? Those tracing context defined by w three C in 2021 allows to propagate span data with the trace parent and the trace state headers. I'd like to point out three more important definition that I use throughout the webinar. The trace id identifies the wall distributed trace. When a span is created, a unique id is generated for it and that is the span id. As the request propagates through services, the caller service passes its own id to the next service as the parent span id. Trace Id and parent span id are then included in the traceparent header. Let's see together an example of distributed application this is sockshop, an open source demo application. It is the usual website architecture. We have a front end various other microservices in those backend with two databases andor a queue manager order, shipping queue master and cut services are auto instrumented using the opentelemetry operator. When a request is received by the backend, a tracer is generated as the request propagates, then through components, spans are added to the tree. All those data is sent to the collector that eventually processes it and then forwards to dedicated storage backends. Let's see how those works. We open sockshop web page. We choose an article and then add it to the cart. Now we are opening the cart and proceeding with the checkout. Let's see what this last click generated in our observability tracing backend. Each span represents a step in the request timeline. We can reconstruct the path followed by the request, for example, orders, call chart services and then the shipping service. I'm showing you two tools to highlight the fact that both an enterprise offered like dynatrace and an open source one like Grafana show traces in the same way thanks to the standard data format. Okay, all nice and clear, but what happens when we do not have those possibility to export tracing data? Can we make the system observable in some way? Fortunately, the answer is yes. Let's see how in many situations we encounter application that are not instrumentable due to restriction imposed by software providers. They just send out logs that are then stored in a database, eventually passing through a telemetry processing layer that decouples the application and the backend. Unfortunately, in this way we only deliver data for the second of observability. Pillar logs. For the sake of simplicity, we leave out the metrics. Pillar Andor focus on the traces from now on. When application logs contain tracing data that is as said trace id span id andor parent span id. This is all we need to correlate them in order to transform a log into a trace pan. We can leverage the processing layer. We use those Otel collector for those Andor now let's see how it works Andor how it can be manipulated to reach our goal. The collector is composed of three main modules, receivers, processors, exporters. The receivers accept telemetry data with different protocols. Processors allow data filtering modification or transformation. Exporters send that elaborated data to endpoint storage. There are two collector versions, those youre and the contrib. This last one includes the youre modules and all the additional modules developed by contributors for their purposes. As you can see, collector components are defined in a YAML config file listed in their own sections, with the chance to customize them with many comments. The highlighted collection is the service section where pipelines can be configured and each component can be enabled only in the step it is made for. We have to be careful about the pipelines in the auto collector releases because they are fundamental to understand. From here they arise from pillars division and they are really independent from each other. A trace information received could be only handled as a trace andor sent anywhere as a trace, same as metrics and log at every step. What we found out breaks this model. We paid our attention on logs pipeline andor. We detect a point in which information could be manipulated to switch from a structure dedicated to logs to another structure related to another pillar. In this case, we were interested in trace structure. We built an exported able to retrieve trace id and span id values contained in the log and represent them as key values for spans before sending them to a target. Able to represent distributed traces with an application able to produce only logs, meaningful logs with tracing ids we have in the end all we need to create correlated spans and so distributed trace the translation takes place in those collector. It leads to a well formatted trace pan that becomes a fundamental piece across the end to end path where a trace before could have been broken. Then the modified exporter forwards the log to the log storage in our case locky, and then the trace in the tracing backends for our case Grafana, tempo and Dynatrace. Let's now see how this works with a demo. We compiled our modified version of those opentelemetry collector and now we run those executable in order to have the service listening on a local port, we created a python script that simulates an application sending event logs. It sends logs in the syslog format to the local port of the collector that is running locally. We decided to focus on modifying the exporter at the chain level is the last point to modify data before transmission the collector takes each log and transforms it into a trace, then sends the original log to locky and the trace to both Grafana tempo and Dynatrace. Let's open our tools to see how this can be visualized in Grafana. We have those possibility to see logs and traces in the same dashboard. With a logkey query. We can find our generated logs pretty formatted in JSON for better visualization. This highlights the switch from the logs pillar to the traces pillar. From here we can directly find the corresponding trace thanks to the integration that queries Grafana tempo. Looking for the trace Id. The visualization is pretty straightforward. Those same can be seen in the other tool, Dynatrace. We navigate into distributed traces and visualize the trace generated by our Python script. Today you saw how to manipulate and transform telemetry data for your purposes. When complex distributed systems handle data adopting different formats, opentelemetry allows you to define pipelines and move data wherever you prefer. Become agnostic from every type of youre thank you for watching. For any question about this session or our offer, feel free to reach us through socials or via email.
...

Alberto Gastaldello

Observability & Reliability Engineer @ Liquid Reply IT

Alberto Gastaldello's LinkedIn account

Andrea Caretta

Senior Consultant @ Liquid Reply IT

Andrea Caretta's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways