Conf42 Kube Native 2023 - Online

Building a business-critical data platform to process over £34bn in card transactions

Video size:

Abstract

This talk is about building Highly Scalable Self Serve Data Platform to support Data Mesh Revolution by leveraging CNCF technologies. Will share the challenges and mistakes we faced while building the platform which ingrates with every single data point( event stream, API, SFTPs, Files etc).

Summary

  • Dojo is the fastest growing fintech in Europe by net revenue. We power face to face payments for around 15% of card transaction in the UK every day. We are highly, highly data driven and value the insights in our decision making. This talk will be around file processing.
  • There were three generation of data processing. Current generation is mostly centralized data platforms. Domains can be independent and create new data sources by themselves. This approach brings a set of benefits.
  • Adhering to PCI standard is one of our prime concerns given we own the end to end payment stack. Most of our workloads and most of our processing power runs on GKE. We have very strict slas and scaling of the platform to meet those slas is a critical part of our architecture.
  • A bit more about auto scaling. Auto scaling is a crucial part of our platform. There are two types of scaling available, right? High level two types. Vertical auto scaling means assigning more resources to the pods that are already running. Multiple ways to trigger this auto scaling you can trigger based on resource usage.
  • We use Kubernetes cluster autoscaler to scale the size of kubernetes. It works by automatically scaling up or down by adding or removing nodes. Files processing can become very scalable all of a sudden. This is where it becomes more like a self serve data platform.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone, thanks for joining. We'll be talking on building a data platform to process over 50 billion in card transactions. I'm Sandeep, I'm engineering lead data platforms. I'm from Dojo. If you don't know about us, we are the fastest growing fintech in Europe by net revenue. We power face to face payments for around 15% of card transaction in the UK every day. We are mainly focused on experience economy of bars, pubs, restaurants, even your local corner shop or farmers market. We also have a consumer facing app that allows you to join the queue for a high street restaurants for up to around 2 miles radius. We have an amazing line of payment products. We have tap to pay pocket pay. We also, sorry, provide the small business funding. And we are also building many, many products to change how the payment industry works. We are going international soon. Hopefully you will see us in every corner of Europe first and then in the world. What makes us different, I guess, and also makes all of this possible is the fact that we are highly, highly data driven and value the insights in our decision making. When you go to a shop and when you tap your card and within a blink of a second your transaction happens there and you get the goods and the customer, the merchant, basically for us it's the customer, but the merchant who is selling you the goods or selling you the services get the money. All of this happens within a blink of a second. As we just talked about, when you tap a card, the first thing happens is the card machine takes the details from the card and securely sends these to the authorization gateway. At Dojo, we use point to point encryption between the card machine and our authorization gateway with hardware security modules in multiple facilities with direct pairing to multiple cloud providers. It's highly, highly secure and highly, highly scalable. We can't let our authorization gateway down because that's the main point from where we take the transactions. The authorization gateway then contacts the card network such as Visa, Mastercard, Amex, and then forwards this request to the issuing bank for approval. This will be someone like your bank in UK Barclays card, Monzo, Tesco, who will then freeze the funds on the customer's account in support. It's very important to note that the money, the actual money has not exchanged yet, and it is just a promise for a payment that can be reversed as well. The card network then sends this approval back to the authorization gateway, which is then displayed to the customer as approved or declined. And this is how your card transactions actually works. And this whole process generates a lot of data from a payment point of view, because we are a payment company. So we have a couple of regulatory challenges where we own the end to end payments experience. So we operate under e money license from the FCA, which is a financial conduct authority. With this comes strict regulatory requirements, most notably the fact that we have to ensure the whole time that customer funds are safeguarded in case the business becomes irrelevant or insolvent. Then you have PCI DSS level one compliance, which is around safeguarding or not safeguarding, storing of full card numbers. So owning the whole payment stacks with full card numbers. We also have to comply with PCI DSS level one. And we also are independently audited every year to ensure that we actually follow all the files and guidelines provided by PCI. Then we have other complexities around schemas. So we process, I guess more than 1000 plus data contracts or schemas. And this talk, sorry, I completely forgot to mention this talk will be around file processing. So I'm going to just talk about how we built the processing of these files coming from different, different sources and all these schemes, Visa, Mastercard, and all the payment boxes. We have so 1000 schemas, as I said. And then you have multiple file sizes and multiple formats. So it's just not like you have standard file size. Okay, we're going to have a file size range from five mb to ten mb. It doesn't work like that. We have files which are two kb, ten kb, 50 kb, I don't know. And then it goes up to gigabytes, and then we also get like zipped files as well. Then you have a scalability, you can have unpredictable demand. Sometimes we have, I don't know, 500,000 of files coming right now because of slas between multiple sources, they overlap and all of the file comes at the same time. And then all of those files sometimes are business critical. So we have to process all of them at the same time. So for example, if you see this is a snapshot of internal reconciliation process performed by our payments analytical engineering team, just to ensure that a merchant's net settlement tallies with card transactions that have been actually authorized on the tills in their shops. As you can see, there's like CSVs XLS XML that many files required to get just that done. And also another bit like all the files coming in doesn't just have XLS or XML, they have all the files formats possible. We have JSON files, we even have Avro files. Now we have parquet files as well. We have fix with files. We have done this proprietary formats created by this scheme, companies like Visa, Mastercard. For that you had to write very heavy custom parsers to actually make sense out of those files and put them into your transactional kind of data warehouse. And the goal which we came up, or the goal which we thought would be good for us to actually take all of these file formats, process them into a final single format which can be then later utilized by processes downstreams to stream that into warehouse, or stream that into Kafka, or stream that into snowflake, or stream that into anywhere else, wherever we want. And we chose Avro because of most of the powers around scheme evolution and mainly that I guess. But there now there are multiple different formats which you can choose. But we chose Avro at that point. And it's not just payments. There are a lot of other business areas in the company which are important. They produce a lot of events data and also generate a lot of files data. And that can be coming from APIs or that can be coming from external tools or that can be generated by their own microservices. So to support the processing of files with all the schema evolution, with the scale and with all these challenges in mind, it was very important for us to design a data platform which is scalable and self serve. And now before deep diving into the modern data infracessing and how we have done it, I would like to take us back into the history of data processing. There were three generation of data processing. First generation was the generation of enterprise data warehouse solutions. Big giants like Oracle, IBM, Microsoft were building those big big warehouse which only few people know how to use. Then came the era of big data ecosystem, the era of Hadoop, the era of pig hide spark came into the picture. We built those monolith big, big centralized, beefy hadoops, big data platforms which was again only utilized or monitored or operated by few people. And huge bottleneck and huge number of skill shortage to actually make the huge, to make the use of it, I would say. Then we talk about current generation, mostly centralized data platforms. Mix of batch and real time processing based on tools like Kafka, Apache Beam also gives you the flavor of the mix. Real time and batch processing, then pubsub, then red panda, then all sort of cloud managed services like AWS, GCP, other cloud providers like confluent, Avon, they are giving their own managed services on top. Well, this centralized model can work for organization that have a simpler domain with a smaller number of diverse conception cases. It files for us because we have rich domains, a large number of sources, a large number of consumers, and most of the companies who are going through such a growth, they also have this problem and this centralized data platform really doesn't work. And there are other challenges with it. The challenge that the centralized data platform is mainly owned by a centralized data team which is focused on building, maintaining data processing pipelines, then building data contracts working hand in hand with stakeholders. But at the end of the day, there's no clear ownership on that. Then issues support for all the domains without actually having the domain knowledge. Then you have silos, then specialized data team which will keep on adding features if they get some time away from the support issues or everyday change requests. Then another issue is data quality, accountability, democratization of data. It's very difficult to enhance or put measures for data quality if there is no clear ownership on the data. The fact that data team manages data access, it makes it a bottleneck when it comes to access request. Then scalability, adding new data sources, increased data volumes, data contract changes, et cetera, can be delayed due to a huge data team backlog because they are the one who are actually managing and maintaining the data pipelines. And the byproduct of this also is not such a great relationship between data team and the other teams, and also a lot of blame game when this goes wrong. And over the last decade or so, we have successfully applied the domain driven design into our engineering or operational side. But as a whole data community, we completely, or as a whole engineering community, we completely forgot to put that into the data side. Now, what should we do in this kind of scenario? Right? How should we go on? And what is the right way of building that file processing platform which we built at dojo, or any kind of data platform which you might want to build it or anybody wants to build it. Now imagine this, right? What if the centralized data team will only focus on creating a generic data infrastructure, building self served data platforms by abstracting away all the technical complexities and enabling other teams to easily process their own data. Apart from that, they will also provide a global governance model and policies to have better access management, better naming conventions, better cis, better security policies, and at the same time, domain ownership moves, sorry, data ownership moves from centralized team to the domains. I have said domains many times in this chat so far, but by domains I mean teams that are working on a certain business area, for example, payment, marketing, customer, et cetera, et cetera. Now, this whole approach brings a set of benefits. Finally, the data is owned by people who understand it better company wise. You reach a better scalability by distributing data processing across the domains. Domains are now independent and they are able to add new data sources, create new data pipelines, fix the issues by themselves. Domains can put better data quality and reliability measures than the centralized data team because they understand it better. Domains have the flexibility of using only what they need from the generic data infrastructure and self serve data platform features, which reduces the complexity blast radius if things goes bad and things always goes bad. And having said that, the centralized data team should make sure that there is a good monitoring alerting in place to monitor all the flavors or features of data platform across the domains. Now this approach will also enforce well documented data contracts data API, which will allow data to flow seamlessly between one system and the another system. It can be internal or external as well. And having domains owning their data infrastructure brings more visibility on resource allocation and utilization, which could lead into cost efficiency. Now this whole concept is, my friend is data mesh. Now data Mesh is, I know I'm talking a lot of theory right now, but I will come back how we build that dojo. But this is very important. So data mesh is like domain ownership. It's one of the best, or sorry, not one of the best. It's one of the important pillar. It's your data, you own it, you manage it, and if there is a problem with it, you fix it. Data as a product, data is not a byproduct. Treat your data as a product, whether it's a file, whether it's an event, whether it's a warehouse, or whether it's a beefy table, all of that treat it as a product. Federated computational governance each domain can set their own policies or rules over data. For example, payment domain sftps encryption standards on payment data or marketing domain sftps retention on the customer data to 28 days to adhere to GDPR standards. Then the final and my favorite bit is self serve data platform. This is the key in all of this. If you successfully build this, which means self serve data platform, you've got 80% things right. This will enable teams to own and manage their data and data processing. But the big question is how you're going to build it. It looks easy now, but when you have so many choices and so many people pulling you in conferences that their data platform is the best, they have the one stop solution. It can be quite overwhelming, but I guess based on your use case you will do certain PoCs. You would try to try to look at probably open source solutions available before buying already, I don't know, committing to thousands and thousands of pounds or thousands and thousands of dollars to some managed providers claiming that they can solve all your problems. When we were building this file processing platform, we were quite sure that the platform, as a service or self serve platform offering was the only way to move forward. Now on a high level, four main features in our platform provides this end to end solution for file processing, which we just talked about before, that we have 20 plus different types of files coming in, and then we have around 500k files every day coming in, which we have to process varies in different size, and then they come in in certain hours and we have to scale the system accordingly as well. Now, four components. First component is source connectors, which is ingesting data from external providers in a consistent way. So we did build those connectors to bring the data. Then you have PCI processing platform, which actually makes sure that a clear credit card information is stored, masked and processed successfully and very, very securely, and then being sent to non PCI platform, and the non PCI platform which takes all the non PCI data and also the data coming from all these other domains and everywhere and perform schema evaluation. I always struggle with this word and data validation and generate outputs into that final format which we agreed before Avro and basically chunked Avro because the file size has been different, the source file size. So we make sure that the final avro files are chunked into somewhere around roughly to the same size. So we don't end up having like one avro file which is two gig, and one avro file which is like few kb's. Sorry. Then target connectors. Streaming the data generated avro files, let's say from this lake house kind of, or distributed data lakes into the data warehouse or into any other streaming system or into any other external kind of like warehouse if Snowflake, or loading that into mlops platform and things like that. Now let's deep dive into all of these components. Connectors, source connectors. We have a number of different data sources. We have storage buckets, we have external APIs, we have webhooks, we have Gmail attachments. Trust me, we still have processes where we have to get the data from attachments. Then we have external sftps servers. So mainly we use Arclone to move most of the data which is coming in files. It's an amazing open source utility, which comes very handy when you want to move files between two storage systems. You can move from s three to gcs or SFTP to any kind of storage bucket, and it works like a charm. But you have to spend some time on the configuration part of it. Then you have Webex. So if the data was not in files, for example in case of webhooks we batch those events into files to keep things consistent. And we did have some strict lsas slas but we did not have slas to process this information in real time. So in those cases where we don't have adhere kind of like slas, okay, we need to process this information straight away. We also use webhook kind of connectors which batch those events and send them into the files. I guess we are moving that completely into the overstreaming side now. I guess that was the historic decision which we took. Then we have serverless functions which allows the data to be received automatically by email, including email body. I just talked about the email attachments up there. We have one provider that does not attach a file but provides a system configuration message in the body of the email. So we had to write a parser for that as well. I know they always have use cases like that and then we use these connectors to land the data in the PCI and non PCI platforms like depending on the sensitivity of the data. So if we know that these files are encrypted and they are supposed to be processed by PCI to get the credit card information masked, they directly go to the PCI platform and if they are not they then bypass that process and directly go to the non PCI platform. Before jumping into the PCI platform, just few lines on what is PCI compliance. So all the listeners would understand how much complexity is or how much things you have to consider while building a PCI data platform kind of environment. Adhering to PCI standard is one of our prime concerns given we own the end to end payment stack and these standards are the set of security requirements established by the PCI SSE to ensure that credit card information is process successfully within any organization. Some of the key points from this compliance are at a high level. They are like all the credit card data has to be transmitted through secure encrypted channels and then clear card numbers cannot be stored anywhere unless they are anonymized or encrypted. Then you have the data platform that the platform which deals with the PCI data has to be audited for any security concerns every year. So when we have PCI, so this is our PCI management process within the PCI platform, we have those files coming in, we don't know the size of the files, they are zipped, they are encrypted. So we have to decrypt the source file in memory using confidential computing nodes. Then we encrypt with our own key and archive the files for future purposes and for safeguarding. Then if it's a zip file, we unzip on the disk. If the file contains pans, we open the file, we mask the pans, then send the masked version of the file to the non PCI platform. Because now, because it's masked, it doesn't fall into the PCI category. And this is how the overall processing looks like. So as you can see, the files start arriving. It's a bit smaller for me. Yeah. So the file starts arriving into the PCI storage bucket and this is all running in GKE. In Kubernetes, Rclone is running as a cron job in Kubernetes. For object storage we are using GCP, GCP's gcs and for queues we are using pubsub at the moment. So just so you know that we are completely GCP based, but most of our workloads and most of our processing power runs on GKE. So it's a good mixture of being cloud native as well as cloud agnostic at the same time. So these connectors are running, they are pulling the file from SFTP or storage or s three, and then these files arrives into object storage. The moment the file arrives in object storage, a file event has been created that okay, the file is created, the files is created, the file is created, and that goes into a pub sub queue. And then based on the number of messages in the queue, the HPA will scale the workload. We will talk about scaling in detail in the later part of the presentation. We typically have around 300 pods running at peak hours across 40 nodes to process around 300 or 200k files within like 30 minutes. Pods will then fetch the file information from these events. In the pub subtopic process, the files mask the content if required. We have very strict slas and scaling of the platform to meet those slas is a critical part of our architecture. We are using horizontal pod scaling and cluster auto scaling together to scale the platform. A bit more about auto scaling now. Auto scaling is a crucial part of our platform. We have to process these files as soon as they arrive so we can perform the settlement and billing operations and also pay our merchants and do the reconciliation of the money, which is very crucial to our business also. On the other hand, we also wanted to make sure that our infrastructure is cost effective and we are not running workloads when they are not needed to be aligned with bit more like a phenopsy culture. There are a few challenges we faced when we implemented horizontal port scaling, setting up resources like on the pods, that was a bit difficult to decide what should be the starting request and what should be the limit. I know there has been a lot of talks in kubernetes that we don't need to put limits and stuff like that, but in our case we had to do it because we are scaling so many pods and pods can consume like a lot of memory and resources. So we tried a lot of different options when we were trying this in production. We started hating pagerduty but eventually leveraging worked out for us and it's now scaling quite nicely. And there are two types of scaling available, right? High level two types. We have horizontal auto scaling which updates a workload with the aim of scaling the workload to match the demand. It actually means that the response to increase the load is deploy more pods and if the load decreases, scale back down the deployment. By removing the scaled pods. Then you have vertical auto scaling means assigning more resources, for example memory or cpu, to the pods that are already running in the deployment. You can trigger. There are multiple ways to trigger this auto scaling you can trigger based on resource usage. For example when a pods given memory or cpu exceeds a threshold, you can add more pods if you want to. Then metrics within Kubernetes any metrics reported by Kubernetes object with the cluster, such as I don't know, input output rate or things like that. Then also like metrics coming from external sources like pub sub. For example you can create an external metrics based on the size of the queue. Configure the horizontal pod scaler to automatically increase the number of pods when the queue size reaches a given threshold and to reduce the number of pods when the queue size shrinks. This is exactly what we did and this is exactly we did and used to scale our platform we are talking about so far. We talked about HPA and adding more number of pods to scale the deployment. But we also need to remember that Kubernetes cluster also need to increase its capacity. How do we do it? How do we fit all these scaling pods into kubernetes? For that we need to add more nodes and we use Kubernetes cluster autoscaler. Kubernetes cluster or autoscaler is a tool that automatically adjusts the size of kubernetes cluster by scaling up or down by adding or removing nodes. When one of the following condition is true, there are pods in the pending state in the cluster due to insufficient resources. And there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes. One very thing, very important thing to remember is that if your pods have requested too few resources when it first started, and after some point your pod wants more cpu or more memory, but your node in which your pod is actually running is experiencing resource shortages. In this case, cluster autoscaler won't do anything for you. We actually did not read the documentation properly and we believe the other way around and we lost good couple of days trying to figure it out why the processing is very slow. You'll have to go back and revisit the resources for the ports all the time. Now we know the HP and CA works together to scale the platform and it works hand in hand and our files processing can become very scalable all of a sudden because we adopted kubernetes and all these flavors of scalability with it. And this is how it actually looks like right now. If you can see, I'm going to take you back again to the processing. You have object storage. All the files are landing and landing and then the queue is becoming, having those events of file creation, right, and the queue is becoming big and big and big. Now how do we scale it? We said unact messages meant. That means this is a pub sub terminology, but that means not processed events divided by four is equal to number of pods or number of workers required. But we still have a maximum limit as well. For example, if I have 500k files still needs to be process, I cannot be running, I don't know, 125k pods. So we can still say okay, maximum 300 or maximum 400 pods can be running at a certain point in time on this particular cluster. And that actually works because the file processing is very fast. So within few seconds it process one file or within a second or so. And this is how the non PCI platform looks like. And sometimes this is the platform where all the validation, all the schema contracts, all the monitoring events and everything kind of is encapsulated into this group of microservices or collection of tools or infrastructure as you can say. So all the collector, it works exactly the same as PCI, just a bit more that it has some extra flavors of schema history and file stores and things like that. So all the connectors send files into the source bucket, which is non PCI bucket. Then file creation events generate that file creation event into the topic. Then HPA comes into the picture it scales the deployment of these translate parts. We call them translate because they're doing the translation based on the config provided. And then the CA cluster auto scaler kicks in to scale the cluster, and then the translate pods actually parse and translate every single file, every single source file into chunked, every files. Now translate process completely works based on what's inside the schema of the file. And this is where it becomes more like a self serve data platform. So every single pipeline belongs to a single domain. Payments have its own pipeline here, marketing have its own pipeline here. And every single schema history, if you see there is like one schema history at the end of the day and there is a UI on top of it from where users can log in, go and challenges the schema, they can say okay, now this file contains an extra column, and I want to process this extra column, but I don't want to go to the data team and raise a request and ask them to process this. I would like to do it by myself. And this is how it happens. So they go to this web interface and there's schema version, schema name, source, blah blah blah, lot of information there. We are actually trying to make it a bit more nicer now. We are actually taking away all the configuration out. Now we have Argo CD workloads and things like that. So we're taking that all out and we're just leaving the schema bit there. But the gist here is like the users can actually, or the domain owners or domain, those teams can actually manage their own pipelines by themselves. We provide the data catalog by using data hub, they can discover everything, what they need to do. That is also an ongoing project at the moment, but it's very interesting. Maybe someday we'll talk about this more then we provide monitoring on top of it. The schema registry also have a component where you can say, you know what, I want to know when my file arrives and when my file lands into the bigquery or my data warehouse. And if that doesn't happen by 09:00 in the morning, I want to get alerted because my processes are going to fail and I need to notify downstream stakeholders. Any other reason? So that's how the schema registry plays a very key role. And at the end of the day it's a data contract between the source and the processing and the target. When we process files from PCI to non PCI environment with the help of schema SD, the files moves from many different stages throughout the whole processing journey and the state management of the file becomes very important because you want the file processing to be fault tolerant. You want to handle errors when the error happens. You also want to support kind of live monitoring. You also want to prevent duplicate processing, because queues can have duplicate events. And you want to make sure that once the file is processed, is processed. Now let's see how the flow works. So object file is created, it goes into the file event, subscription topic, for example. So then it's kind of in a to do state. And after that the HPA is listening to that topic, and then it say, okay, let's scale everything and all the pod starts consuming from this topic, and they consume from this topic. They goes first to the store, file store, or you can call it a data store, a state store, to check whether the file is already being processed by some other pod or not. If that's the case, then they skip it. If not, then they put that status into in progress, and then they start processing it. And if everything is succeeded, they said, okay, translate is completed, or translate was started before translate completed without any error, everybody's happy. And if that doesn't happen, if there's an error, then they say, okay, there was an error. And for some transient errors, we can actually resend the files to be processed again. So they put the status of the file to to do again so that some other pod can pick it up. This is also very useful for monitoring. So all these events are being sent into metricstore. And then we have a file monitoring service which actually talk to the schema registrar, and based on the config provided in those schemas or feeds, what we call it, actually aggregate this information and start generating metrics, report or alerts, or send more aggregated information to Grafana and for infrastructure observability. So most, as I said before, everything what we do and everything what we run mostly is on kubernetes. So we are running Prometheus integrations at the moment, getting all the important metrics, such as resource usage pods, health nodes, health pub sub queue metrics, et cetera, whatever, literally to Grafana. And then we have live dashboards running, which actually reflects the status of the platform. And the users can actually go and see their own feeds, their own pipelines, and see if anything is down or not. And they also get alerted. We also get alerted because the centralized team own the infrastructure, so they actually come as a second line support to fix if the issues are happening there. I'm not going to touch much into the analytics platform, but analytics platform is made basically to analyze all the raw data which is coming into bigquery and then run some DBT models and then create those drive tables and then the insights out of it. Leveraging is based on Kubernetes and then this can be also not. This can be. This is really this whole deployment of analytics platform is owned by every single domain. So payments have its own, marketing have its own, customer has its own, everybody have their own kind of analytics platform. This is my kind of like a showcase end to end file monitoring. And this is actually the slack message. Look like if you see that stage one, stage two, stage three, stage four, stage five and see the stage four is failing and the user can just click on it which file is failing and then from there we have playbooks and then we have ways to identify the errors and ways to fix them as well. This is how the overall ecosystem of data platform looks like. I just talked about today the PCI platform and data file processing platform and only touched the analytics platform, but we have done a lot of work in streaming side, we have done a lot of work in discovery, observability, quality, governance, developer experience and we are still doing a lot more. And we still have to go a long way to completely embrace this self serve data platform or data mesh. And if anybody is interested please join. Go to this dojo career page, not just data team. We are hiding across I guess all the functions. And of course thank you for your time. And if you want to dm me directly or connect me on LinkedIn, this is my, this is my profile. Thank you so much, have a good day.
...

Sandeep Mehta

Engineering Manager, Data Platforms @ Dojo

Sandeep Mehta's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways