Conf42 Cloud Native 2024 - Online

Enriching the data vs Filtering in Spark

Abstract

Spark is known for its in-memory computation. But in-memory computation, particularly inner-join on large datasets, causes issues with backtracing on how data got filtered out in each stage. This talk highlights lessons learned from production and how we pivoted towards one over the other.

Summary

  • Capital loan is the first US bank to exit out of legacy on premise data centers to go all in cloud. Capital loan also contribute really into open source projects. The company stays true to its mission of change banking for good.
  • First we are going to really see the loyalty use causes in capital. Then we will go through a databased approach for the same which enables us to compare two different approaches. Finally we will conclude and leave some time for the Q A.
  • Capitalone is a pioneer in credit card and this platform is the one which pretty much any of our credit card products gets processed through. This is the platform built on top of Apache Spark which processes millions of transactions. And this is our brewing ground for comparing these two design patterns in Apache Spark.
  • filtering the data approach is done using Spark's inner join. What happens is the non matching ones naturally get filtered out at each stage and only the ones which are eligible or which are matching is what really makes it to the end. When we put this in production, there has been some challenges we really faced.
  • Hope you all had some details about me. I'm a capital engineering manager. As Apache Spark I regularly give presentations based on big data NoSQL. Hope you all have great conference.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Thank you for taking your time for the presentation of enriching the data versus filtering the data in Apache's part. I'm Gokul Prabagaren, engineering manager and Carl loyalty organization Capital. So before we really dive into today's topic of our interest, enriching versus filtering, I would like to give some details about capital loan to capital loan is the first US bank to exit out of legacy on premise data centers to go all in cloud. You can imagine what kind of a tech transformation a public company would have gone through to really achieve such a feat. So that is why we are a tech company currently which happened to be in a banking business. We have invested heavily into our tech capabilities and we pretty much operate now as a tech company. And these are all things possible mainly because we are a founder led company till date, staying true to its mission of change banking for good. How we really stay focused to that mission, there are many ways we do. One of that is we give back to our community and in that also we do multiple things. This being a tech conference and also we are a tech focus organization. I would like to start off with we not only operate as an open source first company, we also contribute really into open source projects as well as from our implementations within our enterprise organization for our financial services company. There are many things we do in the regulated industry which can benefit others. So we also give back lot of those things as open source projects. And there are many things we formed as open source project which came from our organization and I have called out few which is like critical Stack, Rubicon, Data profiler, data compi, cloud custodian. They all play in various spaces like DevOps, Kubernetes, data yaml, data cleansing and lot of things. So that's open source for you guys. The next one is coders. Coders is a program we run in middle schools across United States where our work with middle school students and provide them opportunities to envision a tech career in their future and also get really hands on experience for them while they are in middle school itself. The next one is Quoda. This is the program which paves way for non tech folks to get into tech stream and tech as their career. And we provide and empower them with the opportunities for them to be really successful with tech. So first we will start off in our agenda. First we are going to really see the loyalty use causes in capital. When we really get into the details, you will understand why this is what is our starting point, right? Because this is where this whole design pattern evolved and that design pattern first starts with filtering the data approach in Apache Spark. And there were some challenges we faced with that approach and that is what led us to a different approach which we tried out and now we are sharing this for all our folks which is enriching the data approach and how those challenges were fixed using enriching the data approach. First we will start off with the theory on what this really means using our use case. Then we will go through a databased approach for the same which enables us to compare two different approaches and why we pivoted towards one over the other. And we believe that that probably something which will help lot of. Finally we will conclude and leave some time for the Q A. So first we are going to start off with loyalty use case in Capitalone. So loyalty use case if you have used any of capitalone product, Capitalone being a pioneer in credit card and this platform is the one which pretty much any of our credit card products, be it Venturex or Venture, focusing on travel or saver one, dining and entertainment, all those transactions gets processed through this platform. And this is the platform which is one of the core credit card rewards based out of Spark. I have abstracted the details for brewty, but if you see we receive all those credit card transactions and we process them by reading them and then we apply a lot of account specific eligibility rules. And then transaction specific eligibility rules. Finally we compute earn and then we persist them in the database. So you can imagine like account specific business rules probably maybe to the simplest term, like hey, is this account is really a valid account to get rewards at this point in time? And the same goes for transaction too, right? There are various things we do, but this is the simplest one where a transaction is this really eligible transaction to earn rewards. So those are business rules we apply and then we compute the earn and that's what gets persisted in the data store. The reason now you can imagine, right, why you are starting off with this. This is the platform which is built on top of Apache Spark which processes millions of transactions. And this is our brewing ground for comparing these two design patterns in Apache Spark. Right? So let's start. So first we are going to start off with filtering the data approach. So just keep the previous picture in your mind and we are going to stretch that same example into a little bit deeper on using filtering the data approach lens, the same use case. Now what we are really doing with filtering the data approach in this we are receiving some transactions when we are applying some account specific business rules, then transaction specific business rules, how are we doing this, the key thing to remember when it comes to filtering the data approaches, it is done using Spark's inner join. And Spark being one of the big data framework specialist using in memory. So filtering the data approaches also do the same, which is inner join using. So with that context, let's see this. So even though we do operate with millions of transactions, just for establishing the fact, like let's take ten transaction is what we are dealing with in this example. So when we are reading those transactions and then we are applying some account specific business rules using Sparks inner join, what happens naturally is it is actually going to filtered out the ones which is not matching, right? So transactions which are got matching in this example, phi are gone. So Phi goes to the next stage. And same we are doing with the transaction specific business rules where we are applying sparks inner join in memory which filters out three of the transaction which is not matching. So then finally only two are making it towards computation. And let's assume true are really eligible transactions and we are computing earn for those. So the key takeaway from this is we are using sparks inner join in memory at each stage of data pipeline. What happens is the non matching ones naturally getting filtered out at each stage and only the ones which are eligible or which are matching is what really making it to the end. So if you see this now filtering data approach may imagine, hey, this sounds like really what spark does, right? Let's stretch this same example, establish the fact using data the same. Now we are having, we have three transactions and we are trying to really do spark inner join to find the matching ones out of this. So what are the things which are matching? So we naturally are getting only two transactions which are matching when we are trying to apply the account status of good, which is which indirectly means that they're good accounts, right? So the same if we do with the transaction eligibility, assume that it's the category is what we are trying to see, which is we don't want to deal with any other payment. So in that case, we are trying to filter out the payment and we are only processing the accounts which have made the purchase. So that is leaving us with one transaction in this example. So that is what goes to our earned competition. And then we really give them all the rewards, whatever they are entitled to. So if you see this flow, it's the same spark inner join in memory which causes one in this case. So what's the problem with this approach? When we put this in production, there has been some challenges we really faced. First thing is after having the application deployed production, it was really hard to debug this. The main thing is, hey, what has happened to those transaction which got filtered in memory? Because it's something which happens at that moment and everything is dealt in memory for the fastness of Apache spark framework. But if we have to really backtracing all those transactions which has happened in memory, that's where the real challenge starts. And being in a regulated industry, we really need to know mainly what has happened to the transactions which we really did not provide earned or we should be in a production for anyone for that matter. Right? You should be in a position to know that what has happened to each one of your records which if something getting filtered out in memory. In some use cases, it probably fits. In our use case, it did not. People who are familiar with apologies park probably may argue that hey, you can do counts at each stage. Yes, that's possible. But there are two issues with that approach too, which is a costly operation in Apache Spark. Data pipelines probably can live with doing counts. But if your processing is huge enough, then you probably may not be afforded to do counts at each stage when you are dealing with millions and billions of rows. And the second problem with the count approach is you will again get to know how many records really got filtered out or made it to the next stage. You will never get to know why unless you really know the context. So these are all the challenges we faced with the filtering the data approach. How did we really overcome this problem? Right, that's where we pivoted. After doing some research, we pivoted towards our next design pattern which is enriching the data approach. The same example, if you see here the same example of dealing with ten transaction, the key difference is instead of sparks inner join, we are going to use sparks left outer join. In this case, the main thing is we are not really filtering out any data at any stage. So in previous case, you started off with ten. You filter out the information, so which means that your number of rows decreases. But in this case, we are got filtering out. We are really enriching the data with all the contextual information from your left data set. Keep enhancing your right data set so that you have all the information so the rows are not changing. Instead, your columns are growing. So the same example, ten transactions. We are applying five account specific rules. Nothing changes. But we have really picked up some columns which are required for us to determine later. So ten rows, again making it to the transaction stage, we are applying transaction rules. The same ten transaction stays. We picked up some transaction specific business rules. Now we apply all the business logic. Then we are actually arranging with the same result good accounts which may purchase right two transactions. Then that's what really is pushed to the next stage and they are getting whatever they are entitled to. So the key difference in this is the rows are not changing because we are enriching the columns using left outer join. Let's drive home the fact using a data example for enriching as well, similar to what we did with the filtered, we are going to see the same example. So here the same accounts and transactions, the left outer join number of rows are not changing. Instead, we really enriched the data set with the status which is what we need to make a determination whether that was a good account. Right same we are doing with the transaction left outer join. So our column got increased where we picked up the category into our data set. Now we have same three transactions with few more columns. And what we are doing is we are using this information and trying to apply some business logic. Then we can enhance with some more columns as well to make really your computation more easier. So what we are doing is, hey, is this really eligible for next stage? So just have them as true. Then you can go and pick whatever that particular column is true for your next stage of processing. So in this, that's the only one which has good account and also they have made the purchase, right? So that's what really we are getting the same result similar to our previous example too. What's the advantage of enriching over the filtering? So if you compare the problems we faced with the filtering, with enriching here the data is not changing. Instead we are really enriching the original data state which captures the state information which makes it easy for us to debug and analyze later. And also we have the data columns and flags captured at each stage. Gives us more granular details to debug as well as backtracing. Hey, what has happened to the other two transaction in our example? Why they did not make it to the next stage? This naturally enables us no need of having counts at any stage because we have all the required information for us to really do. Cool. You probably all got some comparison details between two Apache Spark based design patterns and you probably may be able to make some informed decisions. Which one fits your use case, right? We really made the switch to enriching the data approach in our production after we went live with the first version using filtering in initial days itself. And that filtering approach is what really is successfully running in production and processing millions of credit card transactions daily and that's what really provides all our customers with millions of cash card and miles. Hope you all had some details about me. I'm a capital engineering manager. I have been building software application from its initial version of Java so as Apache Spark I regularly give presentations based on big data NoSQL as well as contribute to capital tech blogs. I have provided my social handles for you guys to thank you for the opportunity watching your valet. Hope you all have great conference. Thank you.
...

Gokul Prabagaren

Software Engineering Manager @ CapitalOne

Gokul Prabagaren's LinkedIn account Gokul Prabagaren's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways