How HMRC Digital secures services at scale

Video size:

Abstract

How we secure HMRC’s digital platform (AWS, 1000+ microservices, 100 teams, ~1500 deployments/month)

identifying vulnerabilities
increase buy-in from teams
lean on an opinionated tech stack
service catalogue and async chat comms to power security collaboration

Summary

Head of product at MDTP, which is the tax platform at HMRC. We try and make sure that Scala services running at HMRA are secure. We're intending to broaden it out and cover things that you can use elsewhere.
HMRC have a multi channel digital tax platform. The platform exists to make building and hosting digital services as easy as possible. And really importantly, nearly all of it is self service.
MDTP hosts nearly all of HMRC's customer facing digital services. The platform abstracts AWS services so that developers writing services to run on MDTP do not need any AWS credentials. What we're really trying to do is remove the complexity and in some ways remove a lot of the choices about which technology to use.
We have a platform security team and an application security team. Appsec focuses on the security of the applications that we host. Any sufficiently large system is going to be under attack. It's important to remember that security isn't a goal in its own right.
There are over 1000 microservices on the MDTP platform. This creates a challenge for HMRC as things are constantly changing. The number of changes is not in itself a security problem. How can HMRC protect itself? Trust, but verify.
A new vulnerability is found and the question is, how do we know whether we are vulnerabilities to it? One of my favorite tools that we've developed on the platform is called the dependency Explorer. It allows you to search through all the dependencies of all the services. It's just a set of spreadsheets that identify areas where risky code could live.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

My name is Ben Conrad and I have the job title of head of product at MDTP, which is the tax platform at HMRC, and we're going to spend a little bit of time today explaining what that is. I'm also joined here with Gerald. Hi, I'm an appsec snooper. Ben's brought me along to the platform to turn over stones, pull on strings and talk in cliches. So you might be thinking that this talk is going to be incredibly know how. We try and make sure that Scala services running at HMRC are secure. And in some ways it is. We're intending to broaden it out and cover things that you can use elsewhere. So Ben, we're here to talk about securing the multi channel digital tax platform to set some context. What is it? The headline is that it is a PaaS, a platform as a service. In an effort to reduce their postage costs and save on the brown envelopes that HMRC likes to send out, HMRC have been building digital services in line with the approach defined by the uk government digital service, which is broadly digital by default. And to make this easier, HMRC have what we call a multi channel digital tax platform. MDTP, or just the tax platform. The platform exists to make building and hosting digital services as easy as possible. MDTP is a platform, AWS, a service, as I say, and it's where the infrastructure, logging metrics, alerting CI CD pipelines, testing prototyping templates, everything that you need to build and develop a digital service is provided out of the box. And really importantly, nearly all of it is self service. So for me, one of the highlights working with you over the years was that relatively recently you told me about that MDTP had a vision statement. Now normally I associate vision statements with verbal gymnastics to make a company sound like everything to everyone without being offensive to anyone, which then gets used to align people on mandatory fund days. But this one I really liked. Simple, secures services for all. Let's go away and try to understand what that means. There's so much that you can get out of this simple statement, probably too much to go in here, but what does it mean for you? MDTP has been existence for the last eight, nine, maybe ten years, and we host nearly all of HMRC's customer facing digital services. As a platform. We provide this set of infrastructure and tools to allow people, developers to build, test and deploy services. But they have to be written in a certain way. They have to be written in Scala with the play framework. And we think we're pretty good at what we offer the platform is hosted in AWS, but the platform abstracts AWS services so that developers writing services to run on MDTP do not need any AWS credentials. And that's a really important point. They are not writing services to run in AWS or any other cloud provider. They are writing services to run on MDTP. And we could move the whole of MDTP to a different cloud provider. And although that would be a lot of work, the services running on MDTP hopefully wouldn't have to make any changes. And in fact, we have managed to do that in the past. We talk about MDTP being an opinionated platform. The opinions we hold define this paved road, this golden path, the bowling alley of success. And that's what we provide to our users. And the intention from that following it is that if you follow the pave road, we'll allow the teams to build services quickly and efficiently. Chat we're really trying to do is remove the complexity and in some ways remove a lot of the choices about which technology to use. And the payoff for this is that if you follow our opinions and you stay within those guardrails, then you can focus on solving business problems and deliver value to HMRC and to your users really quickly. Now, this talk is about security, the secure services bit of the statement, and specifically appsec. Yes, appsec. It is, I think, a little bit contentious that we have a platform security team and an application security team, and maybe we'll touch on why that split exists. Application security, which is say, we do differentiate from our platform security team who focus on the infrastructure. We've always taken responsibility for the securing of the platform itself, the infrastructure, the features that we build. However, I guess it should be clear that if you're just concentrating on the infrastructure, it's only one side of the coin. The platform itself can be hardened and you can hold it to be relatively secure, but that doesn't really count for a lot if the applications that we're hosting are riddled with vulnerabilities that are easy to exploit. I guess you'll forgive the analogy. If you make really thick, strong walls that will withstand all sorts of attacks, it's not actually very useful. If the windows and doors are left wide open, it's not going to provide that high level of protection. The services hosted on MDTP have always, and I hope always will have responsibility for their own security. However, because of the consistency and the way that all of these services are built using common tools and common technologies, we're able to effectively look for vulnerabilities not just in a single service, but across hundreds of services at a time. And we can also provide tooling that enables teams to proactively check for known vulnerabilities in their own code as part of an automated CI CD pipeline. It's also important to remember that security isn't a goal in its own right. It's something that always needs to be looked at in context. It's no good throwing lots of security tooling at the problem and then giving yourself a pat in the back and a tick in a box. We process payments of hundreds of billions of pounds a year, and we legitimately pay out many billions, even in years, without a global viral pandemic. The applications on the platform process the data for around 45 million individual UK taxpayers and about 5 million companies. And that data in itself is really valuable, and the UK government has got legal responsibilities to protect it. So the Appsec team are focused on looking at the security of the applications that we host. I guess this is who are we worried about, who are the threat actors? And in some ways this isn't that important, in that a number of different threat actors may actually be looking to exploit the same vulnerabilities. But it's always useful to sort of know your enemy, I suppose. So when we are doing a risk assessment, who are the threat actors that we're looking at? And rogue engineers is something that we know we could have. It's worth mentioning. It's not that we don't trust our engineers, but it's worth remembering that people can have their credentials stolen, they could be blackmailed, they could be exploited. Scriptkitties. So any sufficiently large system is going to be under attack. And we're not a WordPress site, but we get lots and lots and lots of requests which seem to believe that maybe we are, because people are just trying anything. It's so cheap to do that. Fraudsters, the garden variety fraudsters. It's important, I think, that security isn't a technical thing, or not solely technical. You're also thinking about how an information system can be abused to trick people out of money. So there's been a spate at the moment of people receiving text messages or communications, which then convince them to hand over their details so that fraudsters can claim tax free payments on their behalf, and often without the victim realizing that this has been a problem so far. Hackers. So it's important to remember that very often attackers will go for the low hanging fruit. It's not so much important chat. You have an unbreakable lock on your door. But you do need to remember to lock it, and you need to make sure that you are taking advantage of the tools that you do have to secure your systems. And finally, nation states. I think this is the one that's most difficult, and it's overkill to think that a nation state is going to attack us, but actually we are a UK government organization and we can't ignore the possibility. So, onto a bit more about the platform. We have a microservice architecture with a lot of services. There is over 1000 microservices. The numbers fluctuate a bit. I think there have been around 200 new microservices created on MDTP so far this year, but not all of those will be running in production yet. And sometimes we get to decommission old services if they get replaced or are no longer needed. How you count teams is quite difficult, because quite often there is a one to one relationship between a team looking after a single service. But we also have live service teams who may look after 50 or so different services, and there are plenty that fall between those two extremes. And of course, the teams vary in size as well. So in total, it depends how you count them. I think there are about 340 front end microservices on the platform. So there are a large number of digital services, which I think just, it really speaks to how inventive this country is at coming up with new taxes. The point is that we're operating at significant scale and quite a lot of changes to code. On this chart you can clearly see Christmas and to a lesser extent, Easter. And I think the last one is the late Queen's Jubilee, which was a lot of fun, no good for productivity. And each of these lines of code could be built into a new artifact, and then those will be tested through our pipelines. If a test fails, then the pipeline will fail and the artifact won't progress any further. This does create a challenge for HMRC, though, because things are constantly changing on the platform, and we want to know that we're not introducing security holes with those changes. But just before I move on, just to be really clear, the number of changes is not in itself a security problem. Indeed, it's very, very much the opposite. If we implemented a change freeze and set the whole platform in async for the next year, we would become far more vulnerable to security incidents, not less. A portion of these deployments will be to upgrade code to remove older versions with known security risks. But all of these changes will be improvements to services that HMRC make available. And the higher numbers, the better. I think that's quite enough context, and I hope I've not bored you all too much. So the question we've got remaining is how can we protect ourselves? Trust, but verify. I used to love using the Russian Prague tovare. No proverray, but I guess that's no longer cool. So we've got a number of different problems. Let's start with one, which is that we have all of these different microservices and they all have dependencies. And so that means that we have lots of dependencies, although not quite as many as there might be, because we have opinions and because we make sure that everything is written in scala with the play framework. But we may well have unsupported or vulnerable code running across our many services. So what can we do about that? So first sort of step at doing that is to introduce something called Bobby rules. Bobby is a tool that we've written that is used as part of the builds, and it fails if there are any dependencies that we don't like. So we can manually say don't use that. If you use it, you can't build. It's quite a severe tool, which is why we sort of tend to be quite careful in using it. We tend to announce to people that things will be deprecated and give them some time to update things. Because if we were to just say one day, oh, you can't use this, then we'll probably get inundated with calls saying oh no, we need an exemption because we have got special circumstances. It's a great tool as well, not only for sort of preventing vulnerabilities, sort of libraries, it's also great for enforcing platform upgrades, and we get quite a lot of reporting on it. We can see trends as to whether people actually look at it. One thing that we have done is we made it so that it can be bypassed. So if somebody needs to have the security fix for something unrelated, then they have to go and ask Ben whether they can do it. And the screen here is a screenshot from catalogue, so actually it's worth talking about the catalog briefly. It's an internal tool that we've developed in house, but it's now possible to use more generic alternatives off the shelf. As a tool. For us, it's something of a swiss army knife. It holds a vast trove of information about the application. Nearly all of that is automatically generated. So there aren't manual updates required to anything here. And we use it to basically keep an eye on the services, to make sure that they're all doing the right thing, and we use it to collaborate with the teams to facilitate that. They can upgrade the things without the least friction. So here's a second problem that we've got. We are big fans of coding in the open. It's really important. It's one of the GDS standards. We believe it makes things better. However, you do not want to be leaking secures onto the Internet. And that is something chat we know has happened before. So again, how can we stop ourselves doing that? So what we've come up with is what we termed a leak detection service. And essentially it keeps teams on the GitHub commits that are being done. And when it finds something that looks sensitive, it will alert the teams via slack alerts, but it'll also alert the security teams, which teams that we can sort of look at the bigger picture. Again, it's important that what we're doing here is to collaborate with teams themselves, with the service teams themselves, to help them to protect themselves. So I guess here's another problem. Again, going back to vulnerabilities, independencies maybe, but maybe something a bit different. A new vulnerability is found and the question is, how do we know whether we are vulnerabilities to it? So this is one of my favorite tools that we've developed on the platform, which is called the dependency Explorer. It allows you to search through all the dependencies of all the services. And this screenshot is why the log for shell vulnerabilities that took place at Christmas 2021 was scary. For only about ten minutes, we got the notification that there's a problem in the log for J Core library. We had a look and we found that it wasn't used. Anyone not aware? The log for J vulnerability was quite scary because with specially crafted messages being logged, it could trigger information leakage and remote execution. The dependency explorer showed us that our services didn't have the dependency, so that was great, but it requires you to know what you're looking for. So this is one of our newer tools, which I know Jared wants to talk about. The problem we've got, I suppose, is that we do have an awful lot of code with a lot of dependencies and we've got vulnerabilities in some of them, but it's how do we know what we're vulnerable to and what we can safely ignore? So typically what happens here is that you get some tooling in and do some dependency analysis. In this case, we've actually got JFrog's x ray. The problem we've found with JFrog was that the x ray resource screens was horrible. If you look at the screen there, you can't really make out what's being said here. The columns are too small and we don't all have a bank of 42 inch monitors to be able to look at these reports. And looking through 140,000 pages of reports, it's just impossible. But the information that's contained in x ray is obviously useful. But there is a problem with that as well, isn't there? Yes. So every vulnerability tends to have a CVSS score that stands for common vulnerability scoring system. And very often, and a lot of tools use that score. AWS a risk score. So you can set up policies that say if there's a risk of higher than eight, then don't allow it. If it's less than eight, it's not a problem. The problem with that approach is that it's not a risk score. Now we've actually gone through each of the cves that were flagged and we found chat. Some of the worst issues did not have the worst scores, and some of the worst secures that we found were not an issue at all. It always depends on context. And if there's one thing you take away from this talk, please don't go away and define policies that say anything less than eight can go through because it isn't secure. What we did do is as a first step, we evaluated all the dependencies and looked at the vulnerabilities chat were in them and then sort of aggregate them. And we actually used spreadsheets. So instead of 100,000 reports, we're looking at the individual cves. And then based on that, we looked at creating a prototype and then we created tickets to sort of say, well, how can we turn this into something that can be consumed by each of the services? We looked at the fact that we've got a huge number of things to process and we tried to be basically agile and sort of created an MVP and then sort of took it from there. The start off was just this very simple three by three board where we laid out what are the things that will be needed by different parts of the organization. And then in less than a month, we had an MVP. That allowed us to look at the problems, but more importantly allowed the service teams themselves to check what the vulnerability issues were. And we provided assessments by the Appsec team so that we didn't overload the service teams and saying, now you've got hundreds and hundreds of different reports to look at. And I think it looks better than x ray already. Okay, so it's all well and good finding problems in somebody else's code, but how can we find examples of maybe bad code in our own? Yeah, it's an interesting thing. And again, you can start really simple here. I mean, we've created what we've termed the risk ledger. It's just a set of spreadsheets that identify areas where risky code could live. As an example, we know that sometimes passing XML can be problematic. So we've identified all the places where XML passing is happening. We then use chat to create a sort of risk ledger to sort of say, okay, these are all the places that we want to check, which then allows us to sort of find the different patterns of usage that happen. Now again, you have to remember that this is an issue of scale. It's possible for somebody to remember this for 20 microservices but not for 1000. And then what we're doing is we're sort of taking this risk ledger approach and now we're starting to build tooling around it. Yeah, I think we've recognized for some time that our success, our scale has created a problem, although it's also created opportunities. Our platform security team were our first attempt to really make security a first class citizen of MDTP. We wanted to start looking at the security of those things that we had direct ownership of. And in a way that was the easy part. Although I think we can agree that like love, security is a journey, not a destination. With application security, there are other challenges. Services on the platform are regularly reviewed from a security perspective, but not as often as we're making changes to them as the platform. We decided that we could do more and that's where the idea of a platform based application security team came from. The first remit is really to go and lift some rocks, pull on some threads and see what the problems are, as Gerald mentioned at the start, but then secondly, investigating what we can do to fix those problems and preferably at a platform level, at that scale level, so that we can protect all the services running on MDTP and not require individual changes across thousands of repositories. So I'd like to borrow here from team topologies, the service teams, the stream aligned teams and Appsec can be considered as an enabling team. Now if you look at this slide, my first attempt looked quite different. It was sort of like a hub and spoke with Appsec, sat at the center, but I felt it completely gave up the wrong message. Service teams aren't at the margins when it comes to talking about security. Security issues are always about context, and it's the service teams themselves that will have the context. So I've tried to sort of visualize this here with a sort of double robbery to say that Appsec can't function without the service teams and the service teams can't do all the security themselves. Right at the beginning I said that my title is an appsec snooper. It wouldn't be possible for me to review security services at the scale of MDTP, if we had 15 different languages, it wouldn't be possible for Appsec to do something centrally if everyone did something different. And so that's where it becomes possible for a central Appsec team to sort of do some of the turning over of stones that otherwise wouldn't be possible, that service teams themselves wouldn't necessarily have the time for. Now, when Appsec finds something, we get in touch with the owning team and we just talk to them. It's not about blame, it's not about finger pointing, it's about collaboration to make services more secure. And the catalogue always allows us to find a slack channel that we can go and talk to. And the whole thing works both ways. When a service team finds an issue, they can feed it back to the central appsec teams, they can ask about best practices. And this is a great example as to how security works. It's about collaboration, it's about people, it's not about tools. But for the collaboration to work, we do need those tools. So this goes back to an important point that I think we've been trying to make all the way through, which is the paved road. The opinionated platform enables a certain amount of centralization when it comes to security, but those are the same benefits for developing services quickly. We know that we can allow or enable services to be built really quickly. We built some for Covid in about four weeks, but the platform supports both that speed and also security. So just on the conclusions here, the pave road is really important. It makes centralizing application security possible, and the tooling is really important. It allows the centralized teams to reach out to the service teams, it allows these findings to be distributed in a self service manner. And we're not just chasing people down. It's also, it really helps if you've got experienced engineers who know what they're doing. Yes, securing a complex system is very hard, but you don't have to do everything at once. And if you want to start with sort of creating an appsec team, I would personally recommend just starting by collecting that sort of threat intelligence. Find out which service writes files or which talks to a particular sensitive back end. Make a list, use a spreadsheet, aggregate it, script it, automate it, scale it, be agile about it. Thank you very much for coming to a talk. Thank you. Take care.

Slides

Download slides (PDF)

See all 46 talks at this event!

Conf42 DevSecOps 2022 - Online

December 01 2022

How HMRC Digital secures services at scale

Video size:

Abstract

Summary

Transcript

Slides

Ben Conrad

Head of Product @ HMRC

Gerald Benischke

Equal Experts, AppSec Lead @ HMRC

Join the community!

Featured event

2025

2024

Info

Conf42 DevSecOps 2022 - Online

December 01 2022

How HMRC Digital secures services at scale

Video size:

Abstract

Summary

Transcript

Slides

Ben Conrad

Head of Product @ HMRC

Gerald Benischke

Equal Experts, AppSec Lead @ HMRC

Join the community!