From Fragmented Logs to Unified Insights: A Personal Journey in Scaling Observability

Video size:

Abstract

At Dropbox, fragmented logs hindered clarity. I transformed observability with open-source tools, unifying container data. Join me for real-world lessons and personal insights on elevating DevOps culture and operational excellence.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome everyone. I'm Alo Engineering Manager for Storage Platform at Dropbox. Today we will discuss how we scaled observability with Grafana Lokey. Our agenda covers the context of our observability challenges. The technical and operational requirements, we faced our evaluation of different logging solutions. And finally why we chose Lowkey. Before we dive in let me give you some context about my background. I'm currently an engineering manager for the storage platform team at Dropbox. Prior to this I worked at Big Switch Networks, VMware and Cisco, primarily focusing on storage systems and scalable infrastructure. My team at Dropbox is responsible for ensuring our storage systems can handle massive scale while maintaining security and performance. To appreciate why this problem was so challenging let's first look at the scale at which we operate. Dropbox serves over 700 million users globally with more than 18 million paying users. We are handling over a trillion pieces of content and, we are processing over billions of new files, uploads every single of every single day. Let's start by defining what we mean by unstructured logs so we can better understand our problem statement. Unstructured logs are essentially raw data outputs that don't adhere to any defined schema, unlike. These structured logs, right? Hi records or traces, which are neatly formatted at Dropbox. These logs come from various sources, including our first party code and third party debug files, such as those found in wall log Dropbox. Because these logs lack a predefined structure, they offer a high degree of flexibility and detail making them incredibly useful for realtime troubleshoot. However, that same flexibility means that analyzing them can be challenging and time consuming, especially when you need to quickly pinpoint an issue during an incident. Let's go over our problem statement. As I said, unstructured logs are stored in wire logs prior to, our implementation or our adoption of lowkey, we didn't have an actual logging solution at Dropbox. Developers used to SSH into individual boxes. We had a host rotation policy of seven days. That means after seven days, the host will be rotated out, so all the logs on that host will be deleted. Around that same time, we were also migrating from standalone host to containers, and as containers are even more ephemeral. They can just vanish any moment. Also, as searching into a container. To, exactly go to Word log. Dropbox was also not that easy. That was also a little cumbersome. So this was the entire problem statement. So from there we gathered a high level requirement. We needed to provide a user friendly and secure interface that makes it easy for production service owners to look at their unstructured logs. As I said, our current process involved manually as searching into hosts to retrieve logs, which was not efficient and error prone. We wanted to automate and simplify this process. We wanted to make sure the system is capable of in ingesting the entire stream of production logs without any mo modification to our application code. Finally the architecture should be designed to, seamlessly integrate logs from our acquisitions and other corporate assets. Yeah from there we gathered some more requirements on the reliability and security side. We wanted to provide at least one week log storage. And as you can see since there was seven day host policy, host rotation policy, this requirement came directly. From there, we analyzed our metrics and figured out we were ingesting approximately 150 terabyte. Per day of logs. Today it is way more than that, but that's where we started. We wanted to make sure the latency for P 99 ingestion is less than 30 seconds and queries for queries. It should be less than 10 seconds. We wanted to make sure our availability goal is 99%. We wanted to make sure we have. MTLS enabled right from the day itself. So that, we are ensured that every connection is authenticated, secured. We wanted to implement strict access control based on service ownership by segmenting access. We could, audit we could provide provide better auditing. We can ensure only authorized teams can view and manage their logs. To further secure our logs. We wanted to make sure we are encrypting all the stored data using some sort of key management practice. And we wanted to finally, we wanted to provide a way to, or a mechanism for detecting, filtering and redacting PII information if it ever leaked. We also wanted to make sure we call out some non goals. We did not want to mandate any changes to our existing log format. We wanted our system designed in such a way that it can work with any unstructured logs we already have so that teams don't have to migrate to it. Our goal wasn't to replace any of the, structured logging or structured logging, tracing, or any of the metric systems. We didn't want this system to, we only wanted this system to be used as a complimentary tool which focuses on troubleshooting. Rather than serving as any other, observability solution we wanted to make sure that the solution was not going to be a replacement for any of the analytic system. And lastly, we didn't want to impose any new logging practice on teams. We wanted to avoid mandating any, anything like how logs are produced or what the format would be. So that existing team owners don't have to feel any burden when this new system is added. Next, let, next let's take a look at the evaluation metric. These, metrics provided us with a framework for comparing different approaches. Cost was definitely the highest priority. We wanted to make sure that the total cost of ownership including both opex and CapEx along with any, potential contract risk our accounted for performance was another critical metric we wanted to make sure. The loading solution could handle our ingestion rate and, matches the query latency. We had in mind. We definitely wanted a rich ux, a rich query engine. And, since our engineers are, were very familiar with Grafana, we were highly interested in solution which could integrate with Grafana. Okay. And along with that we wanted to make sure there is ease of integration with our current observability tools. The solution, we wanted a solution to seamlessly work across our existing infrastructure structure without major changes. And finally, security. Like we wanted to make sure the there is robust data protection so that we can minimize risk of sensitive data exposure. All right, and yep. Now let's consider the do nothing scenario. The pros of course, were no additional envi investment. Cons was we would have to continue with manual, non-scalable, inefficient troubleshooting. So ultimately the status quo was was not meeting meeting the, modern observability needs. So that's why we didn't proceed. Now let's examine our evaluation of all the other logging service options. In this slide, we'll compare several potential solutions side by side. So the first one was externally managed SaaS. Here we'll be using some we will use a logging solution offered by a third party service. Of course, the pros where it reduces in-house management overhead. But the cons was very high cost of operating it. And also since we would be sending entirety of our logs to a third party service, there was a potential risk of exposing PII to them. So that's why the solution was rejected, both because of cost and security concern. The second option was managed cloud logging where things like, managed search and logging solution will be hosted on a cloud framework. Pros was these kind of technologies have. But have matured enough, they are scalable. But again, even for this option at our scale, the operational cost was going to be very high. And we looked at few solutions and we felt UX was still not where we wanted it to be. The third option was self-hosted enterprise, where, we will host some enterprise grade log management system on our premises pros where, you get all the rich feature set, you get very robust vendor support. But the cons was the licensing was still going to be very expensive. And it would still need to run on our infrastructure. So that cost is there plus overhead of maintaining it. So that's why we rejected this as well. The last option was building our own logging solution just build a custom developed solution, which fits our need. Obviously this was the most exciting option, but this is not our core competency. And we realized that it's the, the observability feel like the vendor over there are so mature that it is better to go with some established solution than, reinventing the meal. Now let's focus on Grafana. Loki being open source. Loki offers a low. Total cost of ownership compared to other proprietary solutions which made it an attractive option from a budget perspective. It's optimized for Dropbox Scale, log ingestion and querying, which means it can handle our massive volumes of logs we generate without compromising on speed. Loki provides a native unified observability interface through Grafana. And that was one of the major factors. Gr loki's structure is architecture is pretty distributed, so it is designed to, grow alongside your needs. Ensuring that, as data volumes increase our observability capability would also remain robust. So overall Grafana, lowkey ticked all the boxes by combining performance, cost, efficiency, ease of use and scalability making it an ideal choice for our logging needs. What is Loki? Loki is an open source project, meaning it's free to use, inspect and contribute to, and this has led to, a very strong community engagement and ra, rapid iteration, helping it to evolve into a production grade solution. Actually. It is built to scale out across multiple nodes ensuring that it can handle increasing volumes of log data. Lokey is designed with high availability in mind, so it remains, resilient even if individual components fail. I. The system supports multi-tenancy you, you can securely isolate log data for different teams or services. This was critical for our diverse production environment. Loki draws inspiration from Prometheus and it leverages a similar approach to metrics collection and querying, which makes it familiar to many of our teams. Because our internal metrics system is also quite similar to to Proteus added score. Lokey is a log aggregation system that efficiently, collects indexes and stores log data while minimizing resource overhead. Now that we have talked about why Lowkey exists and how it differs from traditional logging solutions, let's dive into how it works At a high level. The first thing you need to get you need to get logs into Lowkey is an agent. For that, we have prom tail. One important distinction to make here is that unlike Prometheus. Loki does not use a pull model. Instead of scraping logs, Loki pushes them. This is one area where Loki is different from Prometheus. So why not follow the same pull based model as Prometheus? The answer is simple. Logs are fundamentally different from metrics logs has generated at very high volume and need to be streamed efficiently rather than pulled periodically. Before we dive into the technical details let's start with the fundamental reason why Lowkey exists. Many existing log logging solutions are difficult to operate and expensive to scale. Loki was built to solve these problems by being very simple to operate and very cost efficient. One of the key design decisions that sets Loki apart is that it does not perform full text indexing. Instead, it only indexes metadata about each line. So why does this matter? Because in most logging solutions the biggest cost driver is maintaining a massive index for full text search indexing every single word from every log entry means you need gigabytes, sometimes terabytes of index storage, leading to a very high memory users and expensive infrastructure. Loki takes a different approach. It focuses on efficient metadata indexing instead instead of, full text indexing, making it much more lightweight and scalable. Now let's break down how logs are structured in Loki. Every log entry in Loki consists of three main component, timestamp, label, and con content. A timestamp provides precise time information for when the log was generated down to the nanoseconds. Labels are key value pairs that serve as the metadata index. Helping Lokey locate logs efficiently. Content is the actual log message itself, which is not indexed, but stored efficiently for retrieval. By keeping the index small and focusing only on labels, lokey allows for faster ingestion and querying with minimal infrastructure overhead. This architecture is what makes Loki highly scalable compared to traditional logging solutions. One of the core concepts in Loki is the log stream. A log stream is simply a collection of log entries that share the exact same level set. This means that the logs generated from the same application on the same server with the same metadata are grouped together. For example let's look at those these log entries. So logs from dummy service running on example node one form OneStream, and logs from dummy service. Running on example, no two form a different stream. By structuring logs. This way we ensure that the log remains organized, queryable, and scalable without the overhead of full text indexing. Now let's talk about how low key stores logs efficiently. Logs in low Loki are grouped into chunks based on their label set. Here is how this approach works. Each log stream is stored in separate chunks. These chunks are sorted in timestamp order. A chunk continues collecting logs until it reaches a target size or a timeout occur. Once a chunk is full, it is compressed and then flushed to an object store like AWS S3, Google Cloud Storage Azure blobs, or even a local file system. This chunk based approach ensures that logs are stored in the most cost effective way possible. Instead of relying on expensive databases, lowkey leverages scalable object storage to reduce cost and improve re improve retrieval efficiency. So how do we retrieve logs efficiently in Loki when a query is executed? Loki first looks at the metadata index to determine which chunks contain the relevant logs. For example if we query service equals dummy service and node id equals example node one. For logs between timestamp T five and T seven looking only fetches the chunks that fall within the time range. It does not need to scan through all the logs, just the ones that match the query criteria. This makes queries fast, efficient, and scalable even when dealing with massive amount of log data. Now what if we need to query logs across multiple streams? Let's say we want logs from both example node one and example node two. In that case, Loki will use the index to determine which chunk match our query, retrieve only the necessary chunk. Instead of scanning all the logs then it'll combine the result and display them in tools like Grafana. This query efficiency ensures that even large scale logging systems. Remain performant in addition to Grafana. Loki provides command line utilities like log CLI, which allow developers to interact with logs from the terminal. And you, it can be integrated with other scripts and workflows. By leveraging stream based querying and avoiding full text indexing, Loki provides a scalable and cost effective way to store logs when working with Loki, label selection is critical. High cardinality labels like trace id, user ID, or, dynamic path values can explore the number of unique log streams. Hurting performance and increasing cost. Instead we should favor low cardinality label labels, like cluster app or file name to keep things efficient. For example even combining just three labels, like log level, status and path with a few possible values. It can easily create 36 unique streams in this case, four into three. So it adds up pretty fast. Now let's talk about how we deploy Loki at Dropbox. But first let's take a look at a simplified diagram of Loki's architecture. At the top, we have the main components of the right and read paths in yellow. On the bottom we have an object storage S3 for storing logs in green. And on the right side we have some shared components like the memc clusters in blue and cluster wide services like compactor and et CD in orange. Starting with the right path, logs are written to the distributors. Which hash the log streams log stream labels. Then they look up the hash ring in et CD and route them to the inges that own their hash range. The ingestors buffer the log chunks in memory, then eventually flush the logs to the object storage in the background. The com the compactor merges the log indexes, index files in object storage for faster queries. For the read path, the query front end receives a query and splits it up into subqueries, which are processed by different queries. The queries fetch recent logs from the cache in the ING adjusters, or it fetches the older logs from the object storage for a sense of the scale of our low key deployment. Here are some top level metrics. We are processing on the order of 10 gigabytes of logs per second, and we store these logs for 30 days, which takes up about 10 petabytes in object storage. Excuse me. We have around 1000 tenants. And our users are making less than one query per second. So as you can see, the system is heavily que skewed towards right over leads. With Dropbox being a storage company, one thing we did is use our internal object storage as a drop in replacement for S3. This has led to cost savings, especially the data transfer costs. With these lower costs, now we are able to increase our log retention from one week to four weeks. One other big difference between S3 and our internal storage is some per performance characteristic. We we noticed that S3 would gradually scale out to handle large reads. And because most queries are scanning many logs, this would lead to timeouts. Whereas our internal systems they have reserve capacity and we also have emergency capacities, so we don't have the same issue. Another thing is because the index files are so large we still have to write them to S3. Lowkey supports isolating the access and storage of logs by tenant. At Dropbox, a tenant is a service, which is a group of projects. Because the logs are stored by tenant, there are performance implications. For one service, we had to use project as the tenant instead of service, because that service just produces too much of, too much logs. Excuse me. Before Loki, engineers had to use production SSH access just to view logs which were present on, on the service house. With Loki, we red, we have redefined the model so that each service acts as its own tenant. This means we are aligning with our existing permission model. What worked for accessing a service is now used to secure log access. So we have extended this model by adding a group permission, enabling teams to grant access to their logs to other teams when needed. And naturally some teams even have permissions that allow them to access all logs. To pull it all together, we developed a custom query art proxy. This proxy intercepts each query to ensure that the user has. Proper permissions which avoids the complexity and cost that would come from using Grafana as native RAC an approach that would also, require us a separate data source per tenant and also would've needed us to, update to Grafana Enterprise. One thing that comes up a lot is teams trying to share logs with other teams. For example, if Team A wants to share their logs with Team B, team B. Must request permission. But because the permission is owned by the logging team, only we can approve it. While there is a incident or a sev going on, this delay can be costly. So our solution for that was break glass which allows any user with a justified reason to temporarily gain access to any service logs. And we keep an audit trail and use safeguards on the amount of logs a user can access. We run Lokey in two data centers in separate geographical reasons, and so that in case one data center goes down, we still have a availability. At Dropbox we call this multihoming. Both data centers use the same object storage and to route logs and queries to the correct lokey cluster. We integrate our DNS servers with load balancers. And now I'm going to co go over some of the scaling challenges we faced by default. Locate key ingestors store a right ahead, log on disc. That way if the inges crashes before flushing the logs, it can recover them. At Dropbox, we disable the right ahead log to prioritize availability over durability. Loki supports per tenant rate limits to avoid the noisy neighbor problem. So what we did is set conservative default rate limits, and teams are notified by an alert when their service goes over those limits. Loki supports overriding rate limits per tenant in a file. And we distribute those overrides to Loki in minutes using our distributed key value store. And we hot reload these override files in Loki component that use it. To scale, read and writes lowkey uses a hash ring to shard logs. Each inges owns a range in the hashing and periodically registers its range and health status in the ring, which is stored in a distributed key value store. The distributor, which receives logs from applications uses that ring to route logs to the correct ingestors. Plus replicate the logs to other ingestors. Here is an example of that. When a log is ingested, the labels for the logs plus the tenant, which is usually the service in Dropbox, is hashed, and then the distributor takes the hash and does the lookup in the ring. Originally, we used et CD as the backing store for the ring. Et. CD is a distributed, consistent. Key value store that's often used for coordination and configuration in distributed systems. It is notable for being the default for Kubernetes, and it is also now widely used at Dropbox. One issue we had with HCD was right contention. Each inges would send a heartbeat every minute and update the ring. Also, when an inges joins or leaves the ring, it updates the ring. At CD stores, the whole ring as a binary blob in a single key. And every time the ring is updated, the whole ring is replaced with a read plus a compare and swap operation with a replication factor of three and 67 ingestors for each. We had 201 total ingess. So as you can imagine, there was a lot of right activity going on. Some ways the right contention would manifest its deployments would take hours because inges were pushed one at a time, and these pushes would often fail because an availability alert would fire also, because CD is a single point of failure. We had some outages when ET. CD went down. Luckily Grafana made member list the default distributed key value store for low key and other projects. So we replaced et CD with it. Our member list is a peer-to-peer Gus protocol, so we no longer had to maintain another component with member list. The ring updates are propagated as smaller deltas instead of the whole ring. However, one downside is, it is eventually consistent. So unlike ET CD, which offers strong consistency, this means we still have to deploy the injustice slowly. But the pros are we no longer have to deal with right contention. A single point of failure and honestly, we haven't had any issue in the last year using member list. We are also able to double the amount of ingestors after migrating to member list. Log indexes determine the query plan, including how many log chunks needs to be fetched and scanned so they have a good deal of impact on performance. Grafana changed the index format from Bold DB to TSDB, which is based on Prometheus TS db. So it just works better with the log label. In summary, our goal was to scale and improve observability and Rob box. We are facing challenges with manual SSH into hosts and dig through unstructured logs. It was a slow error prone and hard to scale process. We evaluated several solutions and ultimately chose Grafana, Loki for its performance, cost efficiency, and native multi-tenant support. As a result, we have significantly improved log retention, lowered cost, and enabled fast secure access to logs across teams. So that's all for my presentation. Thank you.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

From Fragmented Logs to Unified Insights: A Personal Journey in Scaling Observability

Video size:

Abstract

Summary

Transcript

Slides

Alok Ranjan

Engineering Manager @ Dropbox

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

From Fragmented Logs to Unified Insights: A Personal Journey in Scaling Observability

Video size:

Abstract

Summary

Transcript

Slides

Alok Ranjan

Engineering Manager @ Dropbox

Join the community!