Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody.
Thank you so much for being here and thank you EH com 42 for accepted my proposal.
The title of my talk is Proactive Cost Management, detecting Anomaly in
Logs with Time Series Use Analysis.
Since today's.
Cloud native world.
Managing cost proactively is more crucial than ever.
We will explore how time series analysis particularly apply to
logs can be useful in this regard.
Nice to meet you here a little about me.
I am Jordan Nino.
I am cloud application engineer at Google.
My patients like in SRE, above scale engineering and observability, of
course, as well as sharing knowledge, draw, reading, writing, and teach.
You can connect with me on LinkedIn, medium or X at jino.
My personal web page is www jino.
How many of you have been in this situation?
Take one moment to think about this.
This Mimi captures perfectly a common scenario where DevOps teams are asking
for more budget from their finance team.
That is very common.
This is a consequence of mis configuring cloud computing needs.
Let me tell you that.
But unfortunately, very common story.
This starts with me.
Indeed.
She's physically similar to me, an oncology dinner, constantly
battling production issues.
Last week during an incident, I realized a critical flow there
are in logs to investigate the root cause of the incident.
You can imagine my reaction at that moment.
After that to have to say that it will be impossible to investigate the incident.
I decided to activate all available logs, activity, data access, and system events,
because that wasn't going to happen.
Me too.
Okay.
And here all was okay, but I got an unexpected consequence.
700% of increment in the billing all due of activating cloud.
Do.
This story is a perfect example of this quote highlighted by Dr. Mad
Doric, the hidden cause of relying on anomaly only detection and response
systems, which while detecting anomaly is crucial, a reactive approach
can lead to significant financial repercussions that are an immediately.
So what is the true cause of in action?
As you can see from this graph, a lack of proactive cost management
can lead to increases of.
In forecasted costs, like this example, showing a 700% increase from
April to May beyond direct expenses in action, results in downtime, in
efficient resource use, security breachs, and wasted engineering time.
The first one related to lost revenue, customer churn, and reputational damage.
Inefficient resort use associated to cloud bills Exploding was
infrastructure security breaches when we have massive financial penalties,
legal costs, and irreparable harm.
And finally, was that engineering time refers to our expense
rule shooting reactive problem.
Similar that the sad story with this context here is what I will cover today.
I will start by discussing cost management challenges, then explore how logs can be
valuable asset despite their complexities.
I will then dive into machine learning techniques, but that is an
introduction of the machine learning techniques at first, of course.
Specifically focusing on time, serious analysis, and I am going to
explore practical use cases before opening the floor for your questions.
Implementing proactive cloud cost, which involves continuously
monitoring, analyzing, and optimizing, expanding on cloud, it is about
preventing instead of reaction cost.
The central premise here is that if logs are part of the problem, they
also hold the key to the solution.
In that con in the context of cloud computing, that
is a challenge environment.
Proactive calls refer to a strategic approach to managing and
optimizing cloud spending before it leads to unexpected over runs.
And it is mentioned here, it is a critical component to maximize
business value from cloud investment while keeping costs under country.
Here are more benefits from proactive cause in the cloud context.
An important benefit.
It's study that low part, anticipating and preventing issues in terms of waiting for
cause spikes or resource waste to occur.
Another benefits include continuous optimization, predictive analytics
increasing elasticity, choosing the right pricing models, and
visibility and monitoring.
The first one related to co continuous optimization is an ongoing process of
refining and improving cloud deployments to maximize resource utilization and
achieve business outcomes and the lowest possible price, predictive analytics
that is related to ensure that cloud resources, for example, virtual machines,
storage databases, are properly sized.
For the actual workloads.
Workloads and increase elasticity, providing a scaling mechanism to
match resource allocation with demand.
So you only pay for what you use.
Choosing the rightsizing pricing models matching storage classes, and as I
mentioned, visibility and monitoring.
That is the focus of this talk.
However, analyzing logs in the cloud presence, unique challenges compared
to the traditional environments, like on-premise environment, for example.
This is a consequence of distributed dynamic and often ephemeral nature
of cloud infrastructure, which can log collection, analysis, and storage
significantly more complex and cost.
And so we circle back to our core idea.
If the problem originates from logs, the solution too should be
formed within the logs themselves.
The challenge is how to effectively extract the solution.
With this challenges in mind, the big question reminds what is the solution?
How do we can use log data into actionable insights from proactive cost management?
A solution lies in SIE anomaly detection.
That is a solution because we have another options of solutions in the state of that,
but in this case, this powerful approach.
Combine sophisticated machine learning techniques with statistical methodologies.
For example, identify insignificant departures from past data.
Pre establish criteria to pinpoint anomalies.
These allow us to move vision, simple thresholds, and detect deviations.
Here what is OMA detection?
It's about identifying patterns that significantly devi from expected
behavior, finding the normal abnormal wipe for loss because chias from reactive
fight fighting to proactive production and give us a early warning system
to catch issues before they escalate.
By using algorithms that can recognize patterns or anomalies in big data
sets, machine learning provides a more advanced method of anomaly detection.
The following are important machine learning methods that frequently
apply in SRE to anomaly detection.
I am going to start with observe own supervised learning algorithms.
In this case, we don't require labor label data as and considering
we don't have classes in cloud do it is well situated for anomaly
detection applications like this case.
Eh.
Techniques like clustering algorithms and colors fall into this category.
The first ones, eh, clustering algorithms include algorithms that cluster
data together based on similarity and recognize outliers as possible.
Anoma and Al colors are models based on network networks
that can recreate in Punta.
Supervised learning alert algorithms and other type of machine learning techniques,
on the other hand, require historical data with label anomalies to train the motor.
This this picture that I make it think illustrate this concept
classification algorithms include super vector machines and random form.
And semi-supervised learning algorithms combine elements of
supervised and unsupervised learning.
And normally this can be identified by the model, which has been
trained on normal data Installation.
Forest is a three based technique that separates data into subsets in order
to isolate the normal, the anoma.
Since many SIE measures have a temporal component that is really
challenge use these techniques.
So th time series analysis techniques are essential for identifying
abnormalities over time in seasonal decomposition to final anomalies
time series that is broke down into seasonal trend and residual components.
I will review more details in the next slide.
Lemme now talk about how Google managed time series.
Because Google Cloud offers some services that provides strategies for for
solving issues related to time series.
When dealing with time series it is important to understand what.
Most are not stationary, meaning data, statistical properties change over time.
I think that is that it could be confused since we are not expert
in machine learning techniques.
But I think that is a good introduction for the topic.
And the most important for solving the issue related to predict cost management.
For instance, financial time series often exhibit a random work with drift behavior.
Similarly, energy production is hugely influenced by factors like
wind and solar supply leading to dynamic patterns for solving that.
The sad, the story that I told you I choose Arimo ARI is the acronym of a air.
Im I air related to outdoor regression.
The model use a dependent relationship between an observation and some
numbers of lack observations.
I integrated this is the mini is the first layer in integrated, the use of
different, of raw observations in order to make this time serious stationary.
And moving an average.
The last letters a model that used the dependency between an observation
and a residual error from a moving model applied to observations.
As I mentioned, these time series models are available in Google Cloud,
particularly through BigQuery ml. BigQuery ML allow you to create
and execute machine learning models using a standard SQL queries.
I am going to show a demo in which you can see this this service and this
feature, including to Arima of course.
Now let me explore some real work examples and use cases where anomaly detection
with time series analysis can provide significant value by using algorithms
that can recognize complex patterns.
Machine learning, eh, provides a more advanced method of anomaly identification.
The following are important machine learning method that are frequent,
that methods are, that are frequently applied in SRE two anomaly detection.
Time series analysis and anomaly detection have broad application
across various industries.
But I choose this one this these use cases because they are very.
Challenging and they are in the state of the art of the market.
And re in retail and e-commerce, they are used for sales and demand forecasting.
Short rate prediction challenges include forecasting new products
and complex product hierarchies.
Financial services leverage it for asset management and product sales
forecasting, despite challenges like nosy data and partially
observable markup decision process.
In manufacturing also use cases include predictive maintenance and
adaptive controls, often facing poor data quality and diverse sensor types.
And finally, healthcare where anomaly detection is used for bed and emergency
occupancy and drug demand forecasting with data privacy and disparate
resource being key challenges.
This is the architecture of my solution, which includes an input that is basically
the logs which are sent to BigQuery using a log router for the time series analysis.
Remember that the proper solution in Google Cloud to analysis the logs
using time series is BigQuery method.
Since third Party Solutions offer a better visualization of results I export the
results of the analysis, the term series analysis to CVS five, which is imported
after that is imported in Google Sheet.
With the aim to visualize the resource.
As you can see here,
let me show, now let me go to the Google console and show you can
detect an anomaly in cloud logging using BigQuery at Time Cities.
To do this, we are going to create a log sync in cloud logging
using the log route option.
In this form, we entered the name and a description for the syn.
We select a destination, which in this case can be a bucket in cloud logging,
a data set in BigQuery, abouting Cloud storage, a topic in pops up,
or a separate Google Cloud project.
For our exercise, we select the BigQuery data set, which enable us
to create or select the data set.
In this option, we can create the data set in the local project in a
new one or select one that already exists, just what we are going to do.
In this case,
we add a filter.
To define the logs that we are going to send to que, in this
case, we leave option in blank.
And as an optional thing, we can create an exclusion to determine which log
records will not be sent to the dataset.
With this information, we click on create Sync and verify this creation.
To do this, we are going to be query service and we are going
to verify that the table exists.
Indeed the table exists.
Now it is time to create the model and detect anomaly inqu.
Considering that the lock tables has several fields with no values that are
not relevant for anomaly detection,
we create the model on a subset of the data using this square.
As evidence by the query, we are going using the time stamp of the logs.
Although we truncate E two minutes because it is the pH level of
granularity that supports time series based on anomaly detection to have
a sample per unit of time, we group the logs by date and length, which
in this case transform values to integer to facilitate estimation.
With this grouping we are, we can have the associated ER count, which
we will use the model configuration.
What is done, let's use another query to track anoma.
Now we are going to download the results to a local file and import them in Google
Sheets to have more visualization options.
With this done, I generate agile in order to visualize a possible anomaly.
As you can see, the anomalies as here, there are a lot of logs generated
in these days, which matches with the behavior in cloud billing.
Having this pattern identified, we can predict the next increment in the size
of logs and in consequence in the.