Conf42 DevSecOps 2022 - Online

Anonymize application logs in DevOps way

Video size:


A look into the problem of handling PII, PHI and other sensitive information captured in the application logs under DevOps lenses. Make a journey to the solution of log anonymization while comparing a software engineering approach with DevOps way that does the same without changing a single line of application code. And then see how it works in practice using Data Loss Protection algorithms.


  • Leonard: Logs anonymization is one of the tasks in an implementation of the data compliance. Data should be compliant to standards and regulations existing in the industry. The data in the logs can be applied into two processes: log ingestion or logs generation and the second logs access.
  • First I would like to review basic implementations that can be done for the logs anonymization. The first solution is engineering oriented solution. This solution basically delegates the actual work to a third party or standalone implementation. It is better fit for scaling and for running in multiple geographic locations.
  • A lot of applications run in the cloud and when speaking about the maintenance cost, one of the possible solutions is to leverage the cloud. I would like to show you a reference architecture that will work and help you implement this solution in the Google Cloud.


This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining me at DevOps 2022. My name is Leonard. I am a developer relations engineer at Google. You in my role I help developers to improve observability of their applications. Today I would like to talk to you about logs anonymization and how it can be implemented. Logs anonymization is one of the tasks in an implementation of the data compliance. As you know, data should be compliant to standards and regulations existing in the industry such as JEPR, HIPAA, PSIDSs and many others. Implementing this compliance can be very expensive, both from the business and engineering perspective. It requires changes to the application code, modification to the environments and engineering processes can be especially complex for applications architected in multi tier distributed service meshes or deployed across multiple geographic locations, which can be subject to different standards and regulations. The compliance address all data that application generates, including data input by the user, data generated and maintained by the application, and also data created by other sources such as CI CD workflows. Data in the logs is also a part of the compliance regulations and has to be compliant. It can be especially tricky because in many implementations of the data, compliance logs observed as a solution of various compliance requirements and not as a target for this compliance regulations. Implementing the compliance also referenced as compliance controls can be roughly divided into three types. Preventive controls implement requirements that don't allow certain operation or use of the data. When you think about these types of controls, you can think about identity and access management as one of the implementation of this type of controls. Another type of the controls is detective controls. It essentially captures all information about the activities that takes part and other things, and it allows to detect and alert any violations of the compliance or it can be used in the postmortem investigations related to the compliance validation. Audit logs are often used as part and other application logs are often used as part of the implementations of this control. The last type of controls, corrective controls, are used to implement the reaction on the detected compliance violations. Let's look how these controls are implemented for the data stored in the logs. The data in the logs the compliance controls can be applied into two processes. First is log ingestion or logs generation, and the second logs access. Compliance controls for log access are straightforward. They usually focus on preemptive and detective controls and can be implemented using identity and access management, audit logs and infrastructure that persist the storage of the log data. This compliance control for the log access are usually universal and across different regulations and the standards. Compliance controls about log ingestion is different in the way that it usually involves detective and corrective controls, which can be specific for the industry the application is used as well as to a geographic location where the application is deployed. The diagram that you see roughly depicts the process of the log ingestion and how the detective and corrective controls can be applied. As you can see, there are two types of corrective controls that can be implemented for the log ingestion flow. The first type is just removing the data that violates compliance from the logs or just dropping the whole log entry or the second type of activity is obfuscating partially or fully the data that violates the compliance. Implementing obfuscation logic requires special knowledge because sometimes it's not enough just to identify the data in the logs so the original information cannot be restored, but sometimes it requires preserving certain properties or qualities of this data for future use. As an example, think about troubleshooting tasks that engineers have to perform and to use the logs to restore the original transaction logic that application run. If for example, the application logs store user identity and use Social Security number of the user of this as a primary user identity. And yes, I know it is wrong and unfortunately many applications still do this. So it's not enough just to obfuscate the part in the logs that store SSM because this way it will be impossible to follow up on the logic of the application flow. It is also important that this affuscated part will still be unique across all other logs and will be the same for the same under. So this kind of implementation can be sometimes changing to do when developers implement corrective control for their logs. Now I would like to review basic implementations that can be done for the logs anonymization. I will do it based on the simplified architecture that captures main challenging elements that you might see when implementing logs and minimization. The architecture includes two tier application with the front end application running on user devices and the back end service that runs behind the firewall. On the diagram you can see blue arrows that captured business communication and transactions of the application and gray arrows that capture flows. Log ingestion flows. Please pay attention that it includes logs captured from the infrastructure. In this diagram it is presented by the load balancer and the application logs as well. The first solution is engineering oriented solution. This is what R and D teams usually do when tasked with the logs anonymization. It is implementing the corrective and detective controls as close to the place where the log entries are generated as possible. Clearly the ownership of the solution is on the R D teams which rise the potential problem for maintaining the application. This is where DevOps originally was born from, so follow up work for maintaining it can be very expensive. As mentioned, implementing it requires specialized knowledge that it is possible that RND teams don't have. Additionally, this kind of implementation runs the solution as part of the business application process. As such, it consumes the same resources that originally allocated for the application itself and reduce the application performance. Additionally, it means that if any problem is identified within the implementation of the compliance controls or some kind of modification is required due to changes in the regulations, it will require enrolling a new version of the application. Even if nothing from the business perspective have changed. Deploying such solution in multiple geographic location can be very changing because it will require some complex and probably again poorly maintainable configuration solution or deploying the solution with more than one implementation of the detective or corrective controls, it will be able to run correctly in more than one geographic locations. And the last but not least important, you can see that this implementation cannot handle logs ingested by the infrastructure itself. So there is another solution that often can be seen that tries to mitigate some of these drawbacks. This solution basically delegates the actual work to a third party or standalone implementation. Usually this kind of implementations are referenced as data loss prevention service. They can be found as a commercial solution or can be developed in house with the same drawbacks of the specialized knowledge. It depends on each particular organization. In this solution, RND teams share the responsibility with the DevOps because enrolling and maintaining the DLP service falls on the DevOps team. While still some of the implementation of the changes remain under the R D team responsibility. For example, to work properly, DLP solutions should be aware about the source of the logs. If the application does not work with structured logs or it works with the structured logs but does not provide enough metadata about the logs source, such as what kind of service generated the log, where the service located, what is the specifics of the environment, and so on. This type of data has to be added and R d team will be responsible to implement this change. Additionally, all logic of ingesting logs the application site has to be modified to redirect logs to the DLP. In many cases, R D teams will have to do this work. This kind of changes can be partially mitigated by implementing some logging agent, but it's not always possible to do. This solution has many advantages compared to the previous one. First, the maintenance is usually done by the DevOps, so there is a team that specialized in the maintenance and it can partially reduce maintenance cost. Secondary it can remove a need in specialized knowledge in the case, the third party solution is used, which also simplifies the compliance validation process and guarantees that compliance requirements for logs and minimization are implemented correctly. Additionally, it runs in its own environment. As a result, it does not consume resources that originally intended to run business logic. It can be enrolled separately being a standalone service, so any changes or configuration in configuration and implementation don't influence the application. Clearly it is better fit for scaling and for running in multiple geographic locations. However, it still has few drawbacks. It is still application focused, so infrastructure logs still get unattended. It still possess relatively high maintenance cost. Also, the cost is shifted toward DevOps and it may, as I mentioned, still require changes in the application code as a result of working versus DLP. So today, when a lot of applications run in the cloud and when speaking about the maintenance cost, one of the possible solutions is to leverage the cloud so we can modify the previous implementation by shifting all work relevant to logs into the cloud. This provides us with a couple of advantages. First, it allows almost completely remove R D team involvement, meaning no need for additional work from the R D team. DevOps team gets full control over the solution from the beginning. The solution implements redirection of the logs using the proxy that fronting the logs management backend and sending all ingested logs into DLP first and then from DLP to the logs management. Usually when the rest of the application already runs in the cloud hosting and in majority cloud hosting, the logs already get enriched with additional metadata and converted to some kind of structured logs so it doesn't require additional modifications in the application code. And for hosts that don't do it, it is possible to use various logs engine such as fluent d or fluent bit that can reformat the logs, especially if the logs are printed to the standard output or just service a local endpoint for log ingestion and then forward the logs with additional information to the back end. The same can be done with open telemetry, but today I don't want to touch upon this solution let you implement the logs anonymization without having specialized knowledge about data identification or obfuscation. You can do it in a scaled and flexible way without changing single line of code and keeping all the work within the DevOps team which already specialize on the maintenance from another side. The additional maintenance cost can be significantly reduced if you host this solution in the cloud which already provide you partial maintenance. It is especially easy when you use one of the major cloud providers which have managed solution for this type of service. I would like to show you a reference architecture that will work and help you implement this solution in the Google Cloud. This architecture fully hosted in the Google Cloud. It includes four different managed services cloud logging pubsap which is managed synchronous messaging service, Apache Beam which is managed ETL pipeline and cloud DLP. The service that provides implementation of the detective and corrective controls out of the box logging service in Google Cloud supports not only storing the logs, it also supports log routing. So you will not require to implement this proxy by yourself. You will have only to configure the proper routing logic in your environment. Additionally, in Google Cloud, structured logs with all necessary metadata about the source of the logs is provided out of the box and Google Cloud provides you with a log agent that can be installed on your environment and will enrich your logs with relevant information. So what is most interesting about this flow is the compliance controls implementation. So let me walk you through the ETL pipeline implemented in the dataflow service which is the managed version of the Apache bin. So for those of you who are not familiar with Apache bin is it is extract, transform and cloud pipeline shortly ETL that allows you to define multiple transformation for data that get ingested into the pipeline and at the end it exports this data to the predefined destination. It can be done on the bulk of the data, which is the kind of final set of the input data, or it can be done for the stream. So in our case, all logs entries just get streamed into this pipeline using the logging and service which routes all logs from the application and from the infrastructure into the pipeline. The first step will be to transform it from the pub sub message and retrieve the relevant data of the log entry and then the second step will be to aggregate this data into big batches and then these batches are sent to the DLP. The purpose of the aggregation of multiple entries into the batches is to save both in cost and performance because all cloud providers, including Google Cloud, introduce some kind of quotas, the total API performance on large volumes. Additionally, most of the cloud providers charge you per API call and sending multiple entries for the single call can save you significantly on your cloud bill for each month. Additionally, it allows you to scale very well with the log intensive application. DLP API allows you to define configuration both for detective and corrective controls, and once detective controls for each entry that detective control identifies as matching the condition, the corrective control logic can be applied. Eventually the processed batch is returned from DLP. It is formatted into the log entries format and ingested eventually to the log storage bucket. This flow is implemented. You can find the implementation of this flow on the GitHub repo. Please find the link in the slides and if you want to have a more in depth understanding about this reference architecture implementation, you can read it in the blog post I posted on the menu. If you have additional questions about this topic or about any other topic of observability, I will be glad to chat with you on discord. Thank you for being with me.

Leonid Yankulin

Developer Relations Engineer @ Google

Leonid Yankulin's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways