Conf42 DevOps 2024 - Online

Who Secures Our Code When an Army of Robots Is Writing It?

Video size:

Abstract

LLMs are helping developers write more code with the same vulnerability. Security is already hopelessly outnumbered, and we’re barreling towards a future with no oversight. This talk will explore the numbers, a vision of what we’re missing, and some open-source tools to jumpstart our journey there

Summary

  • Arshan Dabirsiaghi: Let's talk about generative AI and how that's going to affect the security of our software. Most people are using, what's been adopted as of now, has been this autocomplete feature. With a higher throughput, it'll come downstream consequences, some good, some bad.
  • The models produce, they produce insecure code, right? The statistics on the right help us understand that. Code bases are just way too big. Can't the models just generate secure code? Can we teach them to do that?
  • A diagram shows what is the secure development lifecycle, where does security fit in? Every time you commit some code, your pipeline runs. The process for acting on these results is very manual. Now imagine we have all the generative AI. What are the robots going to do when they find a static analysis finding?
  • Some studies say that developers outnumber security 101. The humans we have aren't cross skilled. Developers don't have great security skills. Security is a tough, complex, fast moving field. What are the things that can scale with the robots?
  • Another strategy is to make it hard to exploit your insecure code. These rasp tools are injecting sensors and actuators into the application itself. These tools can help with many things, but there's still things that humans can't help with.
  • Code modder is an open source code mod library for Java and Python today. The idea is a code mod is a little bit of code that changes a lot of code. What we're trying to do with this open source project is create 50 or 100 rules or code mods in order to upgrade code.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, and welcome to my talk. Who's going to secure the code? Army of robots is going to be writing. So, my name is Arshan Dabirsiaghi. I've been at the intersect section of code and security my whole career, around 20 years. And all of these bullets can be summarized as I've always been fascinated with the idea of taking over your computer against your will and all the different aspects that come out of that, including protecting your data. And so let's talk about generative AI and how I see from my vantage point how that's going to affect the security of our software. So if you look at the studies that have been put out by the major players, you would be led to believe that the tools we have today for helping us code largely and take the form of these fill in the middle models, like GitHub's copilot, they help us produce 25 CTO, 60% more coding throughput. Now, these numbers, really, they depend on how you measure and what is the activity. I think a lot of people say that the acceptance rate is between 20% and 30% for GitHub's copilot. Suggestions? I haven't had a chance to play with the other tools, but these fill in the models, really, you can think of them as autocomplete, right? And just with autocomplete, we have to acknowledge that throughput is higher. And now we can argue about the numbers and what they mean, but with a higher throughput, it'll come downstream consequences, some good, some bad. But right now, what most people are using, what's been adopted as of now, has been this autocomplete kind of feature. And more recently, we've seen coding assistants jump into our IDE, things like magic and GitHub's copilot in the IDE. And so these things are pretty new. And so, adoption is not through the roof yet. And what we saw on the autocomplete side was this, I don't want to say modest improvement to throughput, but one standard deviation away from what we do today. Right now, if we have an assistant that is drafting whole sections of code, whole files of code, you can see in this example, we're asking copilot to create a new button component, and it's able to do that. And so if a fill on the middle model can deliver us 25 cto, 60% more throughput, we have to ask ourselves, what is the throughput that a drafting assistant like this could do? And the studies haven't been done on this yet. The studies really are just coming out now for the fill in the middle models, we're left to guess here. I'm guessing sort of out of thin air here, 100%, just for an argument's sake. And if you watch the GitHub keynote, just a few weeks ago, you would have heard that GitHub has said that they're refounding the company based on Copilot, and they demoed something at the end with their just one more thing that was really impressive. And so they showed essentially somebody opening up an issue and saying, hey, I want to add this feature. And the GitHub copilot feature that they were demoing created a plan, as you can see here, went through the code and changed files, and we'll try to auto fix like compilation errors. Or from the demo, it seemed like it was going to try to solve small kinds of errors on its own. And so now this is a tectonic shift from fill in the middle, right. This is doing requirements, sort of guessing at the requirements, guessing at those requirements translate into code requirements and performing multiple file changes, and then itering on those changes to get them to a working state. And so if autocomplete can give us these modest numbers, what can a feature like this do? Were again, left to speculate. Seems like a lot. So there have also been studies on the models to help establish that they're not that good at security. In fact, they're kind of poor. The models produce, they produce insecure code, right? The statistics on the right help us understand that. They've done studies where they had a control group that didn't use a coding assistant, and then a group that did use an assistant, and they found that consistently, the group that used the assistant produced more insecure code and perhaps more dangerously, believed that their code was more secure than the control group. This also reflects my experience testing the top commercial offerings. If you ask the models about SQL injection, it's fairly competent. SQL injection is super common issue. There's a ton of literature out on the Internet about it. But if you ask it about a vulnerability class that's just slightly less popular, I have not found it to be competent at all in delivering fixes, analyzing itself for whether the code is secure. It can't reason about these issues with any level of competency. So what our experience is that basically the LLMs don't produce secures code. Now we have this issue where, and maybe this shouldn't be surprising because the LLMs are trained on human code, which contains bugs. And so in fact, they're overtrained, right? I mean, the data set that they are trained on is all on insecure code, because that's the code that's in GitHub, that's publicly available. And so it's not a surprise that the LLM then produces insecure code. So if you wanted the why or how are they producing insecure code? That's really, how is that the models are really a mirror held up to the existing human vulnerable code. So then the question was posed to me, can't the models just generate secure code? Can we teach them to do that? Maybe. And this is harder than people think. And to understand why, you need to understand the nature of a vulnerability. And so if we look at a vulnerability that takes place, all these blue bubbles are little pieces of code. User input actually comes into the system here and then it bounces all around the system. And so the system is not just what's in your GitHub repo, right? The system, the application is a combination of your custom code libraries, frameworks, runtime, third party services, et cetera. And so as the data, untrusted data kind of flows around your system, you'll see that eventually, in this example, it reaches a place in the runtime where it shouldn't. And so lots of vulnerabilities can be modeled this way. SQL injection, cross site scripting. Many of the medium high critical vulnerabilities look this way. So it's tempting to just say, well, we'll just take the whole code base and shove it into the context window and try to reason about the safety, the security of the code. And you might notice that a lot of this vulnerability data flow isn't even in your code. So that's one problem. And then also code bases are just way too big. The most popular, biggest context window today, I'm not even sure if it's available for public offering yet, but there is 100K context window from anthropic, if I'm remembering that correctly. But 100K tokens is not going to be enough. It might be able to fit microservice with one endpoint in it, but when you look at the manifest files, the data files, all the code, we're going to need millions of context window in the millions in order CTO fit most apps into it. So the next, maybe most obvious step would be to take all the code and cram it into embeddings, which are another way that we augment LLM usage in order to give it knowledge about other things that are big. I know that's a very simple way of thinking about it. But, for instance, you might feed an LLM all of your docs in order to make a useful chat bot. And so you would give it the ability to sort of search over that space, and that works relatively well. But searching is different from reasoning. And so if we just were to cram all of the code into an embeddings and then ask it to connect these dots for us, that hasn't worked in my experience, and we haven't seen that anything suggesting that that is possible yet, either. From the wider marketplace. And generally, models are confused by very long series of events where you have to reason across many different steps. And most of these issues involve many steps and sometimes many variables. So adding those two dimensions, llms tend to deliver less, in my experience, on those types of problems, where it's got to keep track of the history of two variables across all these different events. It's quite difficult. I think we need to figure out how to make, if we wanted to solve this problem, we have to figure out how to make these problems smaller for the LM. And just as a point of reference, we've tried to build static analysis tools, code analysis tools, that are purpose built for exactly this problem, and they can't do this fast and accurately. And so the hope that the very general purpose, working at the speed of inference, it feels like we can't even get this right with really purpose built tools. It feels very far fetched to me. So I want to show you, this is a diagram put out by Pedro Tutti to talk about what is the secure development lifecycle, where does security fit in? And so what you see here is this is a company that takes security very seriously. Whoever actually works CTO, this diagram, they have a range of activities that are sort of like things you do at the front end in terms of things you do once to establish the context of that app for this lifetime. And there are things that you do continuously, either during development, testing, and then in production as well. And so they've labeled here some of the things that are manually performed with this little m. And I'm going to tell you that this graph, this diagram, is way underlabeled, doesn't have nearly enough M's. And I'll give you an example. So, in the beginning, yes, you would do threat modeling once. Threat modeling is an exercise. Were you sort of look at the inputs and outputs to the system, you look at the third party systems it connects to, and you try to predict ahead of time what are the different threats you might want to face? What are the controls you want to make sure you have. And so this is an expert led human process, usually looking at a combination of cloud infrastructure and literally papers to try to do this process. So this is obviously very human, very manual. You're asking a ton of questions when you lead a threat model to try to enumerate the real world picture of this app. And so, of course, this is going to happen one time. Very unlikely it'll happen a lot after that. But every time you commit some code, your pipeline runs. You're going to run a static analysis tool. You're going to need CTO look at the results of that when it's found. Right. And so now imagine we have all the generative AI, all the robots are working, they're cranking out code. What are the robots going to do when they find a static analysis finding? Well, let's talk about what the humans do. But first, we're going to put m's here, because now we've said, look, when we do static analysis, scanning of the code, and there's a finding, we need to do something about it, we need to triage it, we need to possibly fix it, we need to do something. And so all of these activities, static analysis, software composition analysis, which is looking at your libraries that you have, especially new libraries coming in. We have dynamic analysis, which is kind of like fuzzing your web application from the outside, or your rest API. We have Iast, which is watching the internals of the system running. You're scanning any containers that you would have built for vulnerabilities and sort of the infrastructure of your app. If you're lucky enough to work in an organization that has penetration testing, you're also going to have pen testers looking at your app sort of occasionally. And so through the, hopefully you can see the through line here is that we have a lot of security processes and we have a lot of security technology, and the process for acting on these results is very manual. And so we list some of the human interventions here. And it's interesting to note that also these activities are not just strictly a developer. There's a lot of developer stuff that's happening here, but also there's some product management stuff, talking about trade offs, there's compliance stuff, there's security engineering. There's a lot of activities here across a couple of different disciplines in order CTO make this secure software factory work. Now, again, what are we going to do when the robots come? Because the question is going to be, are we going to slow down the software factory in order to accommodate inserting humans in all these places. And typically when somebody says go fast or go secure, businesses choose to go fast because they have to compete. And so they feel like they can't tie one hand behind their backs because for a lot of different reasons. And so just today I want to talk about how our programs are limited. I think developers often think that security is somebody else's job, and to a certain degree that is true. But I just want to give you a glimpse behind the wall here. Developers at least, and some studies say that this developers outnumber security 101. My experience is if you go to a giant bank or a giant financial institution, these ratios are much worse. And I wouldn't want to speculate on what the numbers are, but at least maybe an order of magnitude off of this, the humans we have aren't cross skilled. So if you think of just very simply, and we just say there's developers on one side and there's security on the other, security understands risk pretty well in such a way that the developers don't, to be honest, they think about security differently, but they don't have the skills often to pitch in directly or CTo, review findings very deeply. They just don't have that skill set. A lot of times they don't come from an engineering background. And then on the other side, developers don't have great security skills. So they don't understand vulnerability classes that well, and we shouldn't expect them to because security is a tough, complex, fast moving field with every vulnerability class is its own interesting rabbit hole. And so they don't have that muscle memory to do a really great job at working through vulnerabilities. And that's what on their own. And that's why often we'll see developers struggle to fix vulnerabilities after one iteration. So if you read bug bounty reports, you'll often see like the developer fixes something, but the attacker is able to get around their proposed fix right away. And so anyway, we have people who are good at parts of this, but we don't have that many people that are really great at both. And then just from a math perspective on the number of humans we have, regardless of how skilled or cross skilled they are, we just don't have enough people to do the jobs. And so what ends up happening here? If this is our application portfolio here, where you're thinking you're the business owner at a big bank or a big technology company, and you say, these are all the apps I have, what typically actually happens is that you label a very small number of those apps, let's say 10%, as the most critical. So these are things that might be Internet facing. They might directly touch some sensitive assets. And so you might choose to say, look, we're only going to run the full barrage of activities and tools on this 10% of our most critical applications. And so this is why we have situations like, there was a major retailer that was broken into a few years ago that suffered a tremendous breach and a really painful breach. And the way they got in was through a contractor HVAC portal. And I'm sure that at some point, somebody looked at this asset of the companies and said, it's just for contractors, it's just HVAC. There's just not enough assets here at risk. It's a small number of people who have access to it, HVAC contractors and so know to get in their foot of the door. It all looks the same to them. And so the attacker in this case breached this system. I'm not sure if they had insider information or they knew a contractor, but they found some way to get to this and they pivoted from were to the soft underbelly of the systems and did a lot of damage. And so if were only doing, let's say, what we want to do on 5% CTO, 10% of the applications that we have, and we're about to raise the throughput of our developers by a lot. Somebody's going to have to explain to me how we're going to secure all this code. And so this caused a lot of soul searching for me, and I'm sure a lot of other people to say, what are the things that can scale with the robots? And I'm just choosing three things here today to talk about. I think there's some more opportunities, but I think I want to focus on the highest yield things. And so one solution, one strategy we can have is to make it hard to be secures. And so Netflix has a term for this. They call it paved roads. So this is the idea where we have a use case, and we give the developer a very simple path to follow. We give them a framework, we give them an abstract type to work with that automatically enforces authentication for them. Right? So they don't have to think about authentication anymore. They don't have to think about identity authentication. They just add the type, add their feature, and security comes baked in. In that regard, you might have another use case where for a developer to make something for their code to compile, it forces you to provide roles for access control, enforcement, and so this is really forcing the developer to acknowledge when they make a new feature, what are the security aspects they should be asking themselves about? And we don't do this enough. Usually we tell the developer, hey, we need you to go make an app that does XYZ. And the product manager, the product owner doesn't really care many times about the security of it. They just sort of assume that security is baked in. And so the developer might stand up a new app with just the base, let's say exprs framework. And that's not going to come with any paved roads, right. The developer is now going to have to reinvent all the security controls and they're probably going to get a lot of things wrong along the way. And so these are some good ideas just to try to force them down a road that's going to either provide the security or force them to provide answers themselves. And you might say, to prevent cross site scripting, all the apps here have to be rest plus Json, right? And that makes cross site scripting patterns a lot harder to create accidentally. And this doesn't just apply to code or frameworks or something sort of at compile time, but maybe if we say, look, there's only one pipeline to use, or if you want to run a GitHub action, we're automatically going to add this in there where were going to force static analysis on every build, or we're going to add a GitHub app that watches your dependencies. And so these paved roads, they really help. And if a robot is going to take your code and add something to it, copilot in my experience has been pretty good about following the patterns that are there. And so if all it sees around it are paved roads, it'll probably use the paved roads. And so it'll force it to reason about some of these things that we talked about and provide some good first draft settings for some of these things. So what does it take to get this done? Sort of organizationally, the first thing is we need strong devex and platform teams. So strong teams who are centralized, who understand developers, who understand the requirements from security and can help developers go down these paved roads. Now the second bullet here is also really important. If you want to have paved roads, unfortunately, you can't say go build whatever you want in whatever language you want with whatever framework you want. That ends up being really difficult, because if you want to build a paved road that like an access control mechanism for every different language, for every different framework, it just doesn't scale. And so the fewer technology stacks you have, the better. And of course, if express is really big in your organization and you have a long tail of other technologies, of course it's still worth it to do paved roads for those technologies that are predominant. But it's hard to do sort of globally, hard to solve globally unless you're a little bit more authoritarian about what technology stacks are allowed. And then we love the developer security champion model, and this has become a really popular model in lots of different companies where we have security cross skilled developers who can chime in, who think about risk maybe a little bit differently than your average developers who can help create these paved roads, help inject security into these paved roads. And so I added a few vendors and tools here at the bottom for you to look into more. If this is interesting for you. One other strategy is to make it so. The first strategy was make it hard to be secures, right? Give developers paved roads so that it's difficult to get off those roads and create accidental vulnerabilities. Another strategy is to make it hard to exploit your insecure code. So traditionally, runtime protections were dominated by these tools called web application firewalls that watched HTTP traffic and sort of tried to signature to detect attacks. And they weren't super accurate, but they did provide visibility into traffic. You could detect obvious attacks. It was hard to rely on them in sort of a blocking code because they had lots of false positives. It's very difficult to watch traffic and pick out the bad stuff and not pick out good stuff accidentally. I've been in a job like that, and it's quite difficult. And so in the last few years, we've seen a class of tools I worked on one called RASp runtime application security protection. And so whereas traditional protection tools sat at the network level and try to protect, build a moat around the app, these rasp tools are really injecting sensors and actuators into the application itself, into the app, the frameworks, the libraries. So these are sort of language level agents that are putting these sensors actuators in and acting on behaviors rather than traffic signatures. And so this is an example of a rasp tool. And so the user input that the attackers sent in, they're trying to exploit a SQL injection. They send in this tick, or one equals one, which is a comment in this SQL dialect. And so they're trying to attack this line of c sharp code where the user input is included here. Now a WAf only sees the input right. It gets the traffic first. It looks at the input and says, is this an attack or not? It has to make a decision that's far too early, far too away from what we call boom to make that decision. But a rasp can look at the application behavior, it can look at the SQL query that's actually being sent, and it can scan it and it can tokenize it, and it can semantically analyze it and say, look this query, some user input came in, I saw it go into the SQL statement. And two things about it irritated me. One is it caused a data context CTO become code. So the token boundary was crossed here. This input looks like it became code at this, or one equals one part. And then there's a clause that always evaluates to true right. This is something we can evaluate at the sensor where SQL is evaluated. And that kind of bugs us too. And so you have so many more degrees of freedom with tools like these that sit in the runtime and can look for malicious behaviors. People who use these tools, they were protected from log for j, the exploits, when the log for J exploit came out because they was watching for malicious behaviors of the runtime, not traffic. And so there's some vendors here that there's not really an open source, good open source option, but it's still a relatively new space. And so if you want to make your code much harder to exploit, this is a very good option, because now you have some confidence that even if the generative AI is producing code that may be secures, that it still can be protected from a lot of different vulnerability classes. Now, these vendors and these strategies work pretty well when it's a vulnerability class that looks the same from your app to the next app to the next person's app. So for instance, SQL injection is the same no matter whose app it is. But we also have a class of vulnerabilities called business logic vulnerabilities that do look different. They look very custom to your app. So you might have business rules that say Mary is allowed to access this data unless it's Tuesday after 04:00 p.m. And so those types of weaknesses, the gaps in our security models there, the misses we create there, those are different from app to app. And so they're harder for both static analysis tools or any kind of analysis tools and for tools like protection tools to understand that there was an exploit that occurred, because sometimes that's allowed. And so how can these tools understand the business requirements being violated here? So although that these tools can help with many things, there's still some things that they can't help with. So if we remember our diagram, we had all those M's on the board and all those M's, all those manual activities, most of them were responding humans that had to respond to an interruption from a security tool. And some of those tools were software composition analysis tools. Some of them were docker tools. But what we found is that most of the results come from code scanning tools. So we need to solve this problem for all of them. But it's interesting that this problem of evaluating the results from security tools also requires the hardest collection of skills across development and secures. So you have to understand the vulnerability class. You need to understand sort of security concepts in general, and you need to understand the code in order to determine is this a real issue? Is it a false positive? Is it something I need to fix right now? And so we can imagine a tool here where we see sonar here finding something, sonar finds a SQL injection vulnerability. We have a security copilot here that's reviewed the code and said, hey, look, I noticed you had some vulnerabilities in that code. I'm going to issue you a pr to try to fix those vulnerabilities. And then after that PR gets merged, the scanner doesn't find anything. We need to be able to do two things to accomplish this reality. We need to be able to triage results to determine if they should be pixee, and then we need to be able to fix them confidently. And so what we see here is a tool doing that. And so there's still a human in the loop here to approve this pr, but the whole job of triaging the vulnerability and creating a fix and verifying that all the tests pass and all that stuff, this can be done by what we're calling a security tools copilot. And so I have an offering in this space, but there's also others that I've listed here. And so I wanted CTo especially highlight. At the bottom here we have this library called code modder, on which we opensource this technology to help other people perform this same kind of activity. And so I'm going to spend a little bit of time on code modder. So code mods are this cool idea. They came out of the first Python community from an engineer, Justin at Facebook. Oh my gosh, I can't remember his last name. And then we saw them sort of jump to the JavaScript world as really the primary users of code mods today. And so the idea is a code mod is a little bit of code that changes a lot of code. And so the Javascript community uses code mods today to do things like upgrade your react four, to react five, update all your code. And so this is a cool use case, but they never really escaped that pattern of usage. When I was looking into how can we automatically secure code on people's behalf? I wanted CTO do more and I couldn't get them to do more. And so that's because they were missing the ability, they weren't very expressive. I couldn't get them CTo highlight or find complicated patterns of code. If you just want to change library a to library b, and you're just replacing APIs, it's not that difficult. But if you want to automatically refactor some code to be secures, well, you need to do a good job of finding the places where it isn't secure. And so this is why I developed code modder with some of my friends. So code modder is an open source code mod library for Java and Python today. And what makes it different and what's so exciting about it is it is really an orchestration library. At first I started to build a library that was very ocean boiling. It tries to offer query language and it was too much to do. A lot of languages, CTO support a lot of languages. And we realized that there's already been so many hundreds of manures invested in tools like contrast, Semgrap, code, sonar, fortify, checkmarks, et cetera. All these tools, they've invested a ton into identifying vulnerable or interesting shapes of code. To go change code mod libraries shouldn't try to replicate that. We should just take the results from those tools and then pair them with tools that are great at changing code. And so tools like Java Parser, Libcst, JS, Codeshift, Go has refactoring features sort of right out of the first class feature of the language. And so we need to create a library that orchestrates these things together. And so this is an example code mod in the code moderate framework, which orchestrates a Semgrep rule. Semgrep is a fun static analysis tool to use. It's really good at building very expressive, simple rules. And so in this example we want to find anytime you're using the random functions and replace the random dot function with a more secure version of that. If you don't know most of the time in a language, when you say give me the next, give me a random number, or give me a random string of characters, it's actually quite predictable. You need to use often the secure version of that library in order to get actually unpredictable, unguesable entropy, which is very important for generating passwords and tokens, et cetera. And so if you wanted CTO write a code mod in Python, this is what it would look like. We create a little semgrap rule to help find the shapes of code we want to change. And then through our magic, the developers doesn't have to do anything in terms of understanding how the tool gets invoked or anything. It'll just jump on the results of that, and for every result, it changed to the secure version of that API. And so what we're trying to do with this open source project is create, I'm not sure, have 50 or 100 something like that, rules or code mods in order to upgrade code automatically. And so this is obviously a fundamental tool if we want to keep up with the robots. If code comes in from a robot and we can have code that changes that code to be secure, that's a big deal, because now a lot of these findings from these security tools won't happen. And the ones that do get found, we can act on them and fix them automatically. And so we can help stay on track with all the code that's coming in. And so here's some links for you to follow if you want to learn more about the open source offering. So that's what I came here to say. I think that there are a lot of things we can do to try to keep pace with the robots, but we have to be realistic and we have to move right now in order to keep up. Most of the enterprises I talk to now are doing PoCs with copilot. And when copilot code in, when code whisper comes in, whenever, whatever LLM is, your preference comes in, we're going to see a lot more throughput and security. I've suffered the same staffing challenges as the general tech industry has. And so how are we going with fewer people than ever? We absolutely today need to create some strategies and start working on people, process knowledge, CTO keep up, because the LLMs are not going to produce secure code. We have plenty of evidence of that, but we're going to own the risk of it as application makers. So happy to be here, thanks for having me. And I've got some contact info if you want to talk further. Take care.
...

Arshan Dabirsiaghi

CTO @ Pixee

Arshan Dabirsiaghi's LinkedIn account Arshan Dabirsiaghi's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways