Proactive Cost Management: Detecting Anomalies in Logs with Time Series Analysis

Video size:

Abstract

Traditional methods of reviewing logs after they arrive often only identify issues reactively. This talk explores a proactive approach: leveraging the power of time series analysis on Logging to detect billing anomalies and predict potential cost increases before they happen.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody. Thank you so much for being here and thank you EH com 42 for accepted my proposal. The title of my talk is Proactive Cost Management, detecting Anomaly in Logs with Time Series Use Analysis. Since today's. Cloud native world. Managing cost proactively is more crucial than ever. We will explore how time series analysis particularly apply to logs can be useful in this regard. Nice to meet you here a little about me. I am Jordan Nino. I am cloud application engineer at Google. My patients like in SRE, above scale engineering and observability, of course, as well as sharing knowledge, draw, reading, writing, and teach. You can connect with me on LinkedIn, medium or X at jino. My personal web page is www jino. How many of you have been in this situation? Take one moment to think about this. This Mimi captures perfectly a common scenario where DevOps teams are asking for more budget from their finance team. That is very common. This is a consequence of mis configuring cloud computing needs. Let me tell you that. But unfortunately, very common story. This starts with me. Indeed. She's physically similar to me, an oncology dinner, constantly battling production issues. Last week during an incident, I realized a critical flow there are in logs to investigate the root cause of the incident. You can imagine my reaction at that moment. After that to have to say that it will be impossible to investigate the incident. I decided to activate all available logs, activity, data access, and system events, because that wasn't going to happen. Me too. Okay. And here all was okay, but I got an unexpected consequence. 700% of increment in the billing all due of activating cloud. Do. This story is a perfect example of this quote highlighted by Dr. Mad Doric, the hidden cause of relying on anomaly only detection and response systems, which while detecting anomaly is crucial, a reactive approach can lead to significant financial repercussions that are an immediately. So what is the true cause of in action? As you can see from this graph, a lack of proactive cost management can lead to increases of. In forecasted costs, like this example, showing a 700% increase from April to May beyond direct expenses in action, results in downtime, in efficient resource use, security breachs, and wasted engineering time. The first one related to lost revenue, customer churn, and reputational damage. Inefficient resort use associated to cloud bills Exploding was infrastructure security breaches when we have massive financial penalties, legal costs, and irreparable harm. And finally, was that engineering time refers to our expense rule shooting reactive problem. Similar that the sad story with this context here is what I will cover today. I will start by discussing cost management challenges, then explore how logs can be valuable asset despite their complexities. I will then dive into machine learning techniques, but that is an introduction of the machine learning techniques at first, of course. Specifically focusing on time, serious analysis, and I am going to explore practical use cases before opening the floor for your questions. Implementing proactive cloud cost, which involves continuously monitoring, analyzing, and optimizing, expanding on cloud, it is about preventing instead of reaction cost. The central premise here is that if logs are part of the problem, they also hold the key to the solution. In that con in the context of cloud computing, that is a challenge environment. Proactive calls refer to a strategic approach to managing and optimizing cloud spending before it leads to unexpected over runs. And it is mentioned here, it is a critical component to maximize business value from cloud investment while keeping costs under country. Here are more benefits from proactive cause in the cloud context. An important benefit. It's study that low part, anticipating and preventing issues in terms of waiting for cause spikes or resource waste to occur. Another benefits include continuous optimization, predictive analytics increasing elasticity, choosing the right pricing models, and visibility and monitoring. The first one related to co continuous optimization is an ongoing process of refining and improving cloud deployments to maximize resource utilization and achieve business outcomes and the lowest possible price, predictive analytics that is related to ensure that cloud resources, for example, virtual machines, storage databases, are properly sized. For the actual workloads. Workloads and increase elasticity, providing a scaling mechanism to match resource allocation with demand. So you only pay for what you use. Choosing the rightsizing pricing models matching storage classes, and as I mentioned, visibility and monitoring. That is the focus of this talk. However, analyzing logs in the cloud presence, unique challenges compared to the traditional environments, like on-premise environment, for example. This is a consequence of distributed dynamic and often ephemeral nature of cloud infrastructure, which can log collection, analysis, and storage significantly more complex and cost. And so we circle back to our core idea. If the problem originates from logs, the solution too should be formed within the logs themselves. The challenge is how to effectively extract the solution. With this challenges in mind, the big question reminds what is the solution? How do we can use log data into actionable insights from proactive cost management? A solution lies in SIE anomaly detection. That is a solution because we have another options of solutions in the state of that, but in this case, this powerful approach. Combine sophisticated machine learning techniques with statistical methodologies. For example, identify insignificant departures from past data. Pre establish criteria to pinpoint anomalies. These allow us to move vision, simple thresholds, and detect deviations. Here what is OMA detection? It's about identifying patterns that significantly devi from expected behavior, finding the normal abnormal wipe for loss because chias from reactive fight fighting to proactive production and give us a early warning system to catch issues before they escalate. By using algorithms that can recognize patterns or anomalies in big data sets, machine learning provides a more advanced method of anomaly detection. The following are important machine learning methods that frequently apply in SRE to anomaly detection. I am going to start with observe own supervised learning algorithms. In this case, we don't require labor label data as and considering we don't have classes in cloud do it is well situated for anomaly detection applications like this case. Eh. Techniques like clustering algorithms and colors fall into this category. The first ones, eh, clustering algorithms include algorithms that cluster data together based on similarity and recognize outliers as possible. Anoma and Al colors are models based on network networks that can recreate in Punta. Supervised learning alert algorithms and other type of machine learning techniques, on the other hand, require historical data with label anomalies to train the motor. This this picture that I make it think illustrate this concept classification algorithms include super vector machines and random form. And semi-supervised learning algorithms combine elements of supervised and unsupervised learning. And normally this can be identified by the model, which has been trained on normal data Installation. Forest is a three based technique that separates data into subsets in order to isolate the normal, the anoma. Since many SIE measures have a temporal component that is really challenge use these techniques. So th time series analysis techniques are essential for identifying abnormalities over time in seasonal decomposition to final anomalies time series that is broke down into seasonal trend and residual components. I will review more details in the next slide. Lemme now talk about how Google managed time series. Because Google Cloud offers some services that provides strategies for for solving issues related to time series. When dealing with time series it is important to understand what. Most are not stationary, meaning data, statistical properties change over time. I think that is that it could be confused since we are not expert in machine learning techniques. But I think that is a good introduction for the topic. And the most important for solving the issue related to predict cost management. For instance, financial time series often exhibit a random work with drift behavior. Similarly, energy production is hugely influenced by factors like wind and solar supply leading to dynamic patterns for solving that. The sad, the story that I told you I choose Arimo ARI is the acronym of a air. Im I air related to outdoor regression. The model use a dependent relationship between an observation and some numbers of lack observations. I integrated this is the mini is the first layer in integrated, the use of different, of raw observations in order to make this time serious stationary. And moving an average. The last letters a model that used the dependency between an observation and a residual error from a moving model applied to observations. As I mentioned, these time series models are available in Google Cloud, particularly through BigQuery ml. BigQuery ML allow you to create and execute machine learning models using a standard SQL queries. I am going to show a demo in which you can see this this service and this feature, including to Arima of course. Now let me explore some real work examples and use cases where anomaly detection with time series analysis can provide significant value by using algorithms that can recognize complex patterns. Machine learning, eh, provides a more advanced method of anomaly identification. The following are important machine learning method that are frequent, that methods are, that are frequently applied in SRE two anomaly detection. Time series analysis and anomaly detection have broad application across various industries. But I choose this one this these use cases because they are very. Challenging and they are in the state of the art of the market. And re in retail and e-commerce, they are used for sales and demand forecasting. Short rate prediction challenges include forecasting new products and complex product hierarchies. Financial services leverage it for asset management and product sales forecasting, despite challenges like nosy data and partially observable markup decision process. In manufacturing also use cases include predictive maintenance and adaptive controls, often facing poor data quality and diverse sensor types. And finally, healthcare where anomaly detection is used for bed and emergency occupancy and drug demand forecasting with data privacy and disparate resource being key challenges. This is the architecture of my solution, which includes an input that is basically the logs which are sent to BigQuery using a log router for the time series analysis. Remember that the proper solution in Google Cloud to analysis the logs using time series is BigQuery method. Since third Party Solutions offer a better visualization of results I export the results of the analysis, the term series analysis to CVS five, which is imported after that is imported in Google Sheet. With the aim to visualize the resource. As you can see here, let me show, now let me go to the Google console and show you can detect an anomaly in cloud logging using BigQuery at Time Cities. To do this, we are going to create a log sync in cloud logging using the log route option. In this form, we entered the name and a description for the syn. We select a destination, which in this case can be a bucket in cloud logging, a data set in BigQuery, abouting Cloud storage, a topic in pops up, or a separate Google Cloud project. For our exercise, we select the BigQuery data set, which enable us to create or select the data set. In this option, we can create the data set in the local project in a new one or select one that already exists, just what we are going to do. In this case, we add a filter. To define the logs that we are going to send to que, in this case, we leave option in blank. And as an optional thing, we can create an exclusion to determine which log records will not be sent to the dataset. With this information, we click on create Sync and verify this creation. To do this, we are going to be query service and we are going to verify that the table exists. Indeed the table exists. Now it is time to create the model and detect anomaly inqu. Considering that the lock tables has several fields with no values that are not relevant for anomaly detection, we create the model on a subset of the data using this square. As evidence by the query, we are going using the time stamp of the logs. Although we truncate E two minutes because it is the pH level of granularity that supports time series based on anomaly detection to have a sample per unit of time, we group the logs by date and length, which in this case transform values to integer to facilitate estimation. With this grouping we are, we can have the associated ER count, which we will use the model configuration. What is done, let's use another query to track anoma. Now we are going to download the results to a local file and import them in Google Sheets to have more visualization options. With this done, I generate agile in order to visualize a possible anomaly. As you can see, the anomalies as here, there are a lot of logs generated in these days, which matches with the behavior in cloud billing. Having this pattern identified, we can predict the next increment in the size of logs and in consequence in the.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Proactive Cost Management: Detecting Anomalies in Logs with Time Series Analysis

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

Cloud Applications Modernization Engineer @ Google

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Proactive Cost Management: Detecting Anomalies in Logs with Time Series Analysis

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

Cloud Applications Modernization Engineer @ Google

Join the community!