Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, good evening.
My name is malicious.
I'm a snow workplace engineer.
Today I'll be speaking on Atory is not a Super Blood
lessons learned from the field.
Before I proceed, I would like to talk about the need of open telemetry.
Open Telemetry is needed in observe library because it's used in
collecting telemetry data, which are used in finding out the behavior.
Of your system or your servers, especially in distributed systems.
The telemetry data are very important because it gives you an overview
of what your system performance is like, and as well give you that,
give you that opportunity to take proactive actions to make sure that
your systems are running as expected.
Now we'll be speaking on the vision, why Open Telemetry was created.
Now, the first vision, why this was created was for Unified, unified Standard.
And what this means is that there have been a lot of promises you know about.
Getting single standard for tracing logs and metrics across distributed
architecture systems environment, depending on where your system is sitting
on, whether it is sitting on the cloud, or whether it's sitting on, on, on premises,
or whether it is outside of the cloud.
So the telemetry has been very important, in these environments to gather this data.
Which gives us an insight of what is going on within our environment.
And the reason for this was to make sure that there is a unified, monitoring across
our environment and the second vision has caught to come with the vendor freedom.
So there have been a lot of, promises around lowering the risk.
On what we can do and how we standardize the data we collect from this environment.
And this has to come with, with the promises we get from different
vendors telling us, oh, how we can use this telling, how we can
use this tooling systems or these platforms to achieve this or that.
But at the end of the day, it boils down to.
What we have in place, what we have in our team, how are we going
through those processes to make sure that, we carry out all ities.
We need to keep our systems efficient and performance a hundred percent
wrong without any resource constraint.
Now, on the other hand, we have been also been promised that the growing
adoption, now there have been a lot of adoptions with, clouds adopting
to different tools and system just as we, we can see let's say, clouds
offerings like Microsoft Azure.
And AWS Google Cloud compute.
So there have been a lot of, promises around these areas as well.
And if you notice there is, there are some sort of, platforms that are
standard that for established is, for instance, we have the data dog.
We have, we have og, we have logic Monitor.
Logic Monitor is not really a big sheet for collecting data, but we
can use these collectors to, try, collect data and to understand what
is going on within the environment.
And also there been a lot of open source projects around pun, which is,
which seems to be, and therefore it.
Very important that, a lot of organizations adopt to this growing
tool, systems and also platform rather than going to platform it's best to
these tool systems of as they would give us what we need to make sure that our
environment are running as we expect
now.
In the reality in the real sense of the whole thing.
So we are talking about the common pain points and we, which we have
encountered, in our various field or experiences bu a DevOps engineer,
platform engineer, software engineer, or a our, or if you are an SR engineer,
depending on what your position is.
So there.
Challenges we, we encounter across these disciplines.
And the first thing I would like to speak about would be
the documentation challenges.
So there have been a lot of documentation, errors, efficiency, making use of obsolete
documentations that needs to be upgraded, using vendors or using tooling letters of
this, which could, help us to propel our.
Our collection of data using our collectors and this documentation
challenges has brought a lot of gaps and redundancies, within teams,
within organizations and within businesses, because no one is taking,
no one is checking all of this.
Formal things because people don't really take them seriously and
this has created a lot of issues.
So therefore, it is very important to make sure that your documentation is.
It's compliant and it's treated as a regular activity rather than
something that someone needs to pick up maybe once in a time.
So it's very important that everyone ensures that the documentation is,
is updated and things are occurring along to make, to avoid any issues
that could impact your environment.
Another issue we also face is the plate burden.
Now this has to do with the sort of, this has to do with the type
of software development kits and software development kit we use across.
To, write our code, be it Java, be it Python, depending on what your
organization uses to progress on the language your team are using.
So it is very important that we measure that the software development
kits or tools or IDs we use.
Integrate with the platform we use, and we also, we need to ensure that we are not
using outdated libraries or we are using always, or we are not using outdated,
codes that could cause an issue in environment or even break our environment.
So this are very important that we make sure that all these are being
checked and we need to also ensure that.
Every developer within the team, every DevOps engineering the
team are using the standard code methodology to avoid problems.
Now, lastly, we are going to speak about version conflicts.
Now, when it comes to the version conflicts, so there have been a
lot of incomp, which creates a constant friction amongst teams,
amongst engineers, organizations.
Causes a whole lot of issues, especially when trying to ingest
data or when trying to collect.
Data or when trying to trace or get logs to check what
the system's going all about.
So this, it just causes a whole lot of issue because we found out that there
have been a lot of issues whereby, the fashion conflicts or due to in
capabilities, or it could be that it does, it could be that the versions are
not, are not compatible with our system.
So it is very important that we make sure that all of these are being checked and
we make sure that the tooling systems we are using the fashions we use to
collect our data within our environment are compatible with our systems.
Now moving forward, we'll be speaking about the scaling headaches, the
collectors and infrastructure challenges.
Now the first thing I'm going to speak about is two.
Next.
Now the open telemetry of basis performance issues as environment
grows, processing pipeline can't keep up with the data volume.
And now this is evident in the situation where you have, a, where
you have an environment that is loaded with a whole lot of, data.
There a whole lot of storage and nobody's checking them.
There is no collectors monitoring this environment.
There is no collector monitoring all the storages.
Nobody knows when the storage gets filled up.
So it costs a whole lot of problem.
For, so in a lot of environment, I can't speak really about what happened.
So we.
So we we had a support server environment that we used of, for our support server
and then also for our data warehousing.
So we noticed that they cluster storage within this environment was filled up.
And because there was no collector monitoring the environment, the.
Story the storage got filled up.
The disk for the server also get filled up.
And this led to scratching of that environment.
And this cost a whole lot of huge, amount of money for both our
business and for our organization.
So it's easier for, to make sure that they are collectors, deployed across
this environment, taking real time performance and also taking taking
real time performance so that, juniors ses, looking after this environment
will be proactively, aware of what is going on so that in case of any
issues, there will be there will be scaling up of add, also adding off
of more storage within environment.
Keep the environment, up and going.
Now another one I would like us to speak about is the resource hunger.
So it's very important to make sure we are not, using possible,
servers to deploy our collectors.
Because what happens, what I have noticed is that possible servers
are not really mostly four.
Demanding environments that need, that consume a whole lot of data because
possible CPUs processors, it a whole lot of memory and ram and this leads
this cause a whole lot of issue.
So it is very important to make sure that you're using.
Good specs for your environment, possible CCPs that are mainly meant for
development and testing environment.
So it's important that the businesses, organizations understand
that they need something higher.
Maybe for instance, enterprise, or something that will give their,
like at least a CP or even more.
Depending on what the business requires, but is impossible.
CPU servers are not recommended, especially for collectors that
monitors, huge environment that, takes your time data.
So it's important that you, that organizations scale up
and and also there's good monitoring with this environment.
Another one I'll be speaking about would be the debt reliability risk.
Now, with that careful architecture and that gets dropped or delayed, and
single points of failure emerge in critical telemetry parts, and what this
means is that we must ensure that while designing this environment, we may
must make sure that there are no gaps.
To where we can lose data and we must make sure that the tooling we
are using for this environment are compatible and appropriate so that we
will be able to, get the monitoring, know what is happening in our
environment, understand the performance, and also scale up where needed.
And this would, help in keeping environment.
Up and running and have, and put us in a better position that will not cost us
any issue in the future or cost us money or, or put us or in the bigger area.
So it's important that while we are designing our infrastructure, all we are
designing our environment, so we make sure that all these gaps are being covered to
avoid any unforeseen circumstance Now.
I'll be speaking on the gaps between promise and practice.
Now, there have been a whole lot of, promise and practices,
but the question is, are these promises fulfilling what we want?
Are we practicing what we preach?
Now I'm going to start with the stage and visualization gaps.
Now, open telemetry defines data price, but not end user analytics.
There is gaps between what Open Telemetry gives us and what we can get out of it.
Now, it is true that Open Telemetry helps us to, get your
time data in observability.
That gives us a better, underst a better overview of what is
going on with environment.
But what we lack is how does this, help a single user.
Are there any dashboards that we can use?
Now, this is where we are lacking this area because we need to make
sure that, there are dashboards like, for end users, which they can use to
also understand what is going on with, within, their personal work, so it's
very important that all of this are being put in check and done properly.
Now.
Secondly, we need to be speaking about the library politic Paris.
Now we are talking about language.
Software development case integration quality, which diverse widely.
And now this has to do with, the type of languages we are using to ingest our data,
collecting our data, whether Java scripts.
So it's very important we make sure we are using the best programming
languages that suit to our needs and business services as well.
Lastly, team education challenges.
Onboarding engineers require significant investment, and this is
very important because it is easier to on onboard engineers to start
working on these tooling systems.
Data collecting logs, ingesting them on trying to triage what is going on in Sava.
But it's very important while we onboard engineers, we provide them with the
necessary training that would give them.
That knowledge that would help them excel, in this position as a, as either as an SRE
engineer or as a monitoring specialist.
Because these are very important that engineers do not only onboard, but
only understand the reason why they are trying, specific environment, what
those data are used for how they can.
Put this as a report and then explain to a wider range what is going on
and how they are working proactively to make sure that what the data is
giving them are being used as expected.
So it is therefore important to make sure.
So therefore, it is important to invest in training engineers to
make sure they are well equipped and also ready to carry on with this.
This task
here, we'll be speaking about, the case study behind implementation pitfall.
First of all, we will be, first of all, we will talk about the collective failures.
Then we talk about the version drift.
And last we talk about the monitoring blank spot, the collective failures.
Leading software as a service providers face a whole lot of collective crashes,
under peak loads, monitoring gaps and occurring critical operations.
And this, it's evident with some software providers where, they provide a whole
lots of, softwares that they know are not really, compatible with the environment.
Or they know that what you are using within your environment could,
proactively impact the software.
So it is, therefore, it's important to measure that you're using the
best, softwares that would work as we expect, that would take the datas
in real time and we can analyze to, make informed decisions because.
It is important that, when collector fails, it could lead us to bigger issues
such as bringing the server down, such as affecting other, services that are, very
important or critical to your business.
So it's very, it's it's imperative that we make sure we are investing
in the right softwares to collect monitoring within our environment.
On the second one, we'll be spin about the fashion drift.
Now we, it is true that weeks can be spent, on triaging, on issues,
maybe library updates didn't match, company could base or whatever.
So this is where we need to ensure that there is no, comp there, there
is no compatibility issues within our environment, depending the type of version
we are using for, getting our logs.
So we need to make sure that the version which we are using matches
with vendors requirement or with the tooling system requirement.
So it is very important that.
We are not using something that is not, meant for the software, because what
this can do is it might work at the first instance, but it won't be able to give you
that right result that would help you to succeed in charging the required locks.
You need to proactively, make your environment better.
Now, lastly.
The mentoring blind spots, a lot of team, go there, they just set up
the monitoring and then they go to bed, or they just feel like, oh, we
have done this, or it is mentoring.
And what happens is that most of all these, engineers or companies is they
follow the absolute way of monitoring servers environment without putting
that, unnecessary actions, which has to do with testing it, testing the
monitoring if it is working, checking the dashboards, triaging, taking
logs in real time to understand.
What is going on?
Just to make sure that everything works as expected before saying yes,
we have put on this monitor in this.
I think it is very important to carry a sort of testing before, closing off
and say, yeah, we are done with this.
Because what happens most of the time is, they dis engineers to go and, set up
the monitoring and then they just leave it and they don't even try checking.
To, should I check in, the methods that are using, if the schemas are updated,
if they need to bring new version, new code version, or if they need to bring
something that would help, make sure that their monitoring is working as expected.
So it's very important to make sure that there is a sort of
testing and confirmation before, handing off to other task.
So the lessons I have learned and what I'm, what really works
from my own understanding, I think I would say start small.
So why I would say start more is that, it is very important that you, that
gets critical systems and limit the initial scope before widely two is.
Custom documentation, investing in internal training, tailoring,
tailored to your specific tech stack depending on the tools you're using to
collect your data is very important.
Making sure you understand that tooling, making sure that everybody's involved,
making sure that, the knowledge base is all, is 24 hours updated.
With the latest information, with the latest steps, latest processes, making
sure that everybody is part of this as a culture and not just as a team, but
it needs to be something that needs to be passed on from, from first line
to the deadline to the highest person.
So it is important to ensure that.
Putting an internal documentation, not just only for a technical team, but
for the whole range so that anyone that is documentation would understand what
is being done and what is being said.
Now, another thing I would like to speak about is the interactive expansion.
I understand that sometimes, engineers might want to add a whole lot of.
Servers just to maybe speed their work or, make sure that the monitoring
is going as they have expected.
I think it's very important that, systems are not gradually added, as the
team grows and as everyone gets think there with what he or she needs to do.
And also it is very important to, add assistant.
Only based on the power of your resource environment.
And lastly is the regular review.
This is where discipline monitoring of schema resources
and integration points comes in.
So it is very important that, we regularly review our monitoring.
We set out time.
It could be allocated to different engineers at a specific time just to
know what is being monitored, to make sure that they, the data, the performance
threshold, all of this are being, cutting as the team has or as expected.
So it is very important that these regular checks are being carried at rather.
Once in a while, maybe it could be done, maybe every once in a week, maybe
every one time, maybe once in, in a week, or twice in a week, depending
on the availability of the engineer.
So it is very important that this is being carried out and everyone is carried
along while this is being done as well.
Now, this is realistic, the realistic expectation and practical recommendation.
Toolkit and not platform.
I think it is very important to, we understand that open Telemetry
provides powerful companies, not a finish observability solution.
So it is important that we focus and invest more on the tool.
Rather than on the platform.
So what I mean by rather than the platform is, you get told, oh,
we need to buy Microsoft Azure.
You need to buy, AWS subscription, depending on the needs.
Oh, we have different types of monitoring tools that can help you
to, scale your business and get what you want or get your data.
I think it's not, I don't think that is the best way to go about it because it,
at the end of the day, you end up spending more money on different resources.
So I think it's best you go for one tool, that will be, that is solely built for.
Collecting, tracing and gathering of metrics of your telemetry data,
maybe such as data drug or what, know this is just an instance, right?
So I think it's better.
You think it's better?
Organizations and teams go for a specific tools rather than going for a platform.
Secondly, ongoing investment Now.
Sources.
I would like to read what this initial this this headline speaks here.
It says that sources require engineering resource, resources, organizational
alignment, and understanding system risk.
And how I would explain this is sources require resources means
that everyone needs to be trained.
Engineers needs to be trained.
They need to have the right resource, they need to have the right knowledge.
There is no knowledge gap.
And also that they also, and this has to align with the organization
as well, what the organization wants to get at from, these engineers, what
the expectations are, and at the same time, authorizing the critical systems.
So this knowledge base, the tooling system needs to be needs to be pro, needs
to be prioritized within the critical systems to avoid any non, any risk that
could cost, could party the business or the services and why organizations
are also investing on the engineer.
So it is important to measure that they're also investing on the infrastructure.
They're always, a, an infrastructure review.
And also there is always an upgrade.
Following up with vendors, making sure that, the service systems
are run on the latest fashion.
I think it's important.
To put an investment into this because this was the goal, would
lead to a, fruitful success both for the business, both for the ENG
engineers, and the systems will be happy if you know the site, the right
resources are being provided now.
The last will be the build complete solutions.
Now, in as much as.
There are a whole lot of benefits we can get out of in open Telemetry.
I think at the same time, it's important to build robust Latin and
virtualization layers around open telemetry because Open Telemetry is just
us as a foundation, but if we have you a robust, alert and virtualization layers
and other toolings that could help.
In making the work easier, I think it's very important that these are being
considered because this would help us to actually get direct result, which we need.
So I would say that, businesses users, engineers, we need to, put
their, put all attention to this rather than just using one tool
system that does everything I think.
I think it, it's, I think it would make sense if they can spread their hands to
get other, tooling systems to add to what they have as a full, complete solution.
Conclusively.
I would say that Open telemetry unlocks control of observability
with that elementary complexity.
And this is where the flex, the flexibility comes in.
And also it.
Also, teams must treat observability as ongoing work and not onetime setup
because there needs to be a continu, a continuous engineering across teams,
across business, across board, across, across, the managers the stakeholders,
all of them needs to be involved because they need to be told or be aware of
what is going on in the environment.
So I think open telemetry of, offers us a good values and all
of this needs to be put in place.
And lastly, the pragmatic benefits.
And this is where, iT approach, which is Partal approach of national,
realizing real world values beyond, the hypes, making sure that you know
everything they need is being done.
So I think it's very important that we make sure that there are
accurate technical documentations, engineers are being trained, and
there is a good investment on the tunnel system which we use to.
Try to collect our data and also, proactively put our
environment in a good state.
So I would say that open telemetry, is a great asset to use is a it brings a
whole lot of benefits and also whole lots of true values to businesses.
So businesses needs not only need to, invest on platform, but
excessively on the 30 systems that works best for the services.
Thank you.
It was nice speaking with, it was nice speaking to you all.
Bye.