Transcript
This transcript was autogenerated. To make changes, submit a PR.
Deploying infrastructure is always a mix of planning and firefighting.
Sometimes everything works perfectly.
Other times you are troubleshooting at 2:00 AM wondering what went wrong.
Hey guys, my name is Prade Gadi.
I'm a senior DevOps engineer.
My agenda today covers a little bit about my journey and my experiences
with incidents, tools and SRE world.
Coming from banking and financial services sector, I started my journey
working with legacy systems such as web spear, tomcat, and manual deployments.
Custom scripts helped, but they introduced a margin for error that made me appreciate
the value of standardized tools like Terraform Helm and CI CD pipelines.
This talk is a collection of real world SRD failures.
Lessons learned and the unexpected challenges that come
with managing infrastructure.
So let me go over my experiences in real world.
Let me start with a data center incident.
Let me tell you this.
Make sure your cloud provider service terms aligned with your requirements,
especially around RPOs and.
When we signed up for when we signed up for GCP, we used Google Cloud
storage with dual region buckets.
We assumed this meant our data would be replicated instantly between
regions, between different buckets.
Then a data center fire hit and objects in our GCS bucket where
unavailable for up to two hours.
Yes.
We later discovered that gcps default replication guarantees only 99.9% of newly
written objects replicated within one hour, a hundred percent within 12 hours.
Meaning if a data center disaster happens, there is a chance that some of your
objects might be replicated, can take up to 12 hours to replicate it to region.
They later recommended enabling turbo replication, which actually guarantees
a hundred percent replication to both regions in under 15 minutes.
Regardless of object size, which we would know this before that way
we would enable this feature and we didn't have to face this situation.
So what's that lesson here?
Always research if a product meets your S-L-A-S-L-A needs
before you go to production.
Read the conditions, apply details seriously.
That's gonna save you from incidents of this kind if such features exist,
so you wouldn't face these issues.
So research the product before you sign up.
Let me go over my experience and incidents with respect to certificates.
So SSL certificates, certificate issues are more common than you would expect.
A few years ago, public certificate validated dropped from three years
to one year, which means you have to renew your certificates every one year.
Especially the public wild card certificates are often reused
across multiple places, which makes tracking them harder.
So how do you track them?
So let's say if you use commercial products like Datadog look for
features like synthetic monitoring.
We, which you could use to track the endpoints and get the certificate expiry
data and renew the certificates on time.
And if you're using open source such as Grafana, look for Infinity Plugin.
That way you could query the endpoint and get the certificate information
and keep track of the certificates.
Also, proper doc certificate documentation and the locations where it installed,
they're installed is very important.
It sounds basic.
But certificate related outages can be seriously embarrassing.
Lemme tell you an example.
At one of the company, we had an outage because it was a financial client because
an intermediate certificate was missing.
Let me explain you certificate terminology.
In simple terms.
The server certificate is signed by an intermediate certificate, and the
intermediate is signed by a trusted route.
If the chain is, the complete chain isn't complete, visitors
may see an incomplete chain error.
So how do you overcome this?
The moment you renew any certificate, go to open source online tools like
SSL Labs to verify the complete certificate change and make sure
you have your route intermediate and the server certificate in place.
Also if you have open SSL utility on your command line, just hit the end point
and use the option called show search.
It's gonna show you all the certificate complete chain.
Also, you could look for something called verify written
code to be zero at the bottom.
That tells you like your certificate is good and the chain is complete.
Let me talk my experience with logging and monitoring.
Logs as an SRE logs are your lifetime lifeline during incident, right?
They're your lifeline.
So if an incident happens, you always look for logs and metrics.
So ensure that all systems have proper logging enabled before you
even pro before you even promote them to production and make sure like
those logs are actually forwarded to a centralized logging platform.
This should be actually part of your pre-deployment checklist.
So let me describe a recent incident.
So we had this incident where some of the records were missing from a
database and that database wasn't configured properly to emit logging.
So there many logs in that database.
Also, the.
Virtual missions, which we are connecting to, that database had
logs, but they weren't forwarded to a central logging system.
So we had to look for logs at multiple places, which delayed our investigation.
So it's always a good idea to, first of all, have proper logging in place and
second forward those logs to a centralized system like Datadog or, Elasticsearch.
If you, if.
You don't wanna use commercial products due to budget consent constraints.
You could always set up open source like Elasticsearch.
Likewise, monitoring plays an important role.
Let me give you an in a simple incident, happened at my previous company.
So I did set up Elasticsearch logging system, I didn't, I couldn't set up
the monitoring for the disc due to various reason and time constraints.
It sounds simple, right?
Being an sorry, how could I miss that?
But, sometimes it's all about the time and priorities.
So I miss set setting up the dis alert.
And moreover, elastic search wouldn't ingest logs when it reaches 80%
disc due to the watermark setup.
So it's always better to set up.
These alerts the moment you install these software or the
moment you install infrastructure.
So the best way to best way is to use some configuration management tools
like Ansible or Puppet to ensure, like these monitoring or logging
agents are always installed as part of your infrastructure provisioning.
You all learn from our mistakes, isn't it?
So the takeaway here is, always have a centralized logging and monitoring
system, and then that way it's easier to track issues and fix them quickly.
Let me brief my experience with Terraform.
I didn't have any real incidents with Terraform, but I absorbed things
could be improved at a workspace, by following few recommendations.
So if possible, use public Terraform modules provided by cloud vendors
unless your use case is truly unique.
I spent a lot of time writing custom modules that I later replaced
with standardized, well-maintained modules from the Terraform, registry.
It saved time and reduced maintenance.
For example, let's say if you wanna spin up a GCP vm.
There are a couple of options.
You could write all the modules and then the resources by yourself, or if
it's, or if your is a simple use case, just try to use the preexisting public
modules from the Terraform Registry.
And then my second recommendation would be, if possible, try to
integrate your Terraform with CICD.
In my previous workplace, we were using, we were all using Terraform,
running terraform from our workstation.
You know that sometimes there is a chance, like you set up some infra
your provision infrastructure, but that doesn't that doesn't have any
approvals from any of your teammates.
It's always better to have approval from your teammates when you provision
an infrastructure or, run any code.
So having Terraform plan and apply steps in your CSCD enables your
teammates to review them before the infrastructure is provisioned.
Also, you could also use some security tools like VI to scan your Terraform code.
It catches misconfigurations exposed secrets and comprehensive
issues before the hit production.
The takeaway here is reuse.
What's reliable?
Shift Terraform into your CICD workflow.
Expect the unexpected.
See, SRE is just SRE isn't just about tools.
It's about resilience, adaptability, and curiosity.
Sometimes you are solving infra shoes.
Other times you are wearing four hats at once.
Yes, but every incident, every mistake, and every fix is a chance to grow.
In my last company, I did set up whole monitoring system and I thought, I'm done.
But no, the data which ha, which was ingested into Prometheus
was coming in different formats.
Some data was coming in proto of format, so I had to convert.
I had to write a Python code and deploy it in a cloud function to
convert that proto of data into a format which Prometheus understands.
So it was an unexpected, crash course across languages, formats, and platforms.
That's all SRE about.
And one, one other time I ha we have chosen ela elastic
for monitoring and logging.
Assuming it's a free tire, you know that the free tire would be enough.
Turns out Slack alerts were locked behind a paywall.
Basically we ha we, we have to upgrade to premium version.
If we wanted to send any alerts to the Slack, then we ended up
writing our own Pi Python scripts to send those alerts ourselves.
So my point here is, always pilot tools in dev environments
before moving to productions.
Because if you know what your requirements better before you even move to production,
then you would avoid a lot of manual work.
You even know, if that product is really worked for you.
So the key take takeaway here is, always pilot things.
And then, try to learn new tech switch hats and fix things under pressure.
You had to be ready for that as an SRE.
Not everything goes according to plan.
Sometimes you are learning, piloting, implementing, and fixing all in one day.
It's intense, but that's where the real growth happens.
Thank you very much for listening to my talk and thank you very much.
This opportunity, please feel.