Conf42 DevSecOps 2022 - Online

Going Beyond Metadata: Why We Need to Think of Adopting Static Analysis in Dependency Tools

Video size:

Abstract

Software supply chain threats is on the rise. Existing dependency analyzers are looking to use static analysis to reduce false positives. I will hold a design discussion with practical examples on the promises & perils of moving towards adopting static analysis in package environments.

Summary

  • Joseph Hydrop is a researcher and software developer at Endolabs. He will talk about going beyond metadata are why we need to think of adapting static analysis in dependency tools. This talk is largely based on his PhD work.
  • US Department of Commerce has a management guideline to how we should do software reuse. It says reusing well designed, well developed, and well documented software improves productivity and reduces software development time, cost and risk. Here we look at some of the problems with using package managers.
  • Not all dependencies are runtime libraries. There are 16 times non reachable transfer dependencies. Should we use more like program analysis, or shouldn't we use program analysis when we do dependency analysis?

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Joseph Hydrop. I'm a researcher and software developer at Endolabs. In this session, I'm going to talk about going beyond metadata are why we need to think of adapting static analysis in dependency tools. This talk in general is largely based on my PhD work, where I've been adopting static analysis to better understand dependencies and how we are using third party components in general. Before we dive into why we even want to consider static analysis in general, I think it's a good idea to understand why we are doing software reuse in the first place. And I think it's always a nice thing to see what are like the key principles and key ideas on software reuse. And I happen to find a very interesting article from the US Department of Commerce that is a management guideline to how we should do software reuse. And in this guideline there are a few core principles that I found very interesting on how software reuse should be as an experience. And there are two principles that I found very interesting. One was on productivity. And here we can find that it says reusing well designed, well developed, and well documented software improves productivity and reduces software development time, cost and risk. Right? And then there's the other aspect of software reuse, which is improvements in the quality of software developed from well designed, well tested, and well documented reusable software components. And here we can see that in general, when we want to use third party components, right, we would like it. Of course, we want to reduce software development time, but at the same time, we also want to reduce risk. And when we are also using third party components, we also expect them to be well tested, well documented, so that there is as little friction as possible. Right? And the way we have, let's say, implemented these principles, and that's, I think what many of you are familiar with are mainly package managers. And here is an example of using NPM. With NPM directly from our command line, we can access thousands of libraries and frameworks. And whenever we want to use a library to solve a particular problem, we can directly hit NPM, install the package name, and it will make it available to the workspace without any problems. And then the third part, which also is really nice, is that it's very easy to publish a package. So if you're developing something that could be useful for the rest of the world to use, it's very simple to use these packet managers as a distribution channel. And there are, of course, some problems with using package managers. And I'm going to highlight some of the problems. So the first problem when we look at in general, is that whenever we install a third party component or library, we often end up importing not only one, we can be importing ten or even hundreds, or in some cases even thousands of dependencies. So for example, in this one here, we can see that it says like building 194 out of 307 dependencies. So that is quite a lot. And these dependency trees are not a simple tree structure, but they're also quite complex, because sometimes you can even have dependencies that are from the same library but contain different versions. So if you see in the figure to the left here, we can see that it has accepts 1.38, but at the same time there is also accepts 2.80, right? And what is like the other aspect of dependencies in some of the packet managers, we also have that there are version ranges. So we have that, for example, if I install accepts today, I get 1.3.7, whereas if I do, three days later it's 1.3 point twelve. So it's constantly changing. And these are like some of the properties with dependency management, and then at a more global ecosystems level, I think whenever we read the news, we always comes to these headlines where some hackers manage to hide some malicious code, for examples like in the event stream tools, or with the very popular left pad, where some developers remove the package, making sure that the build systems wouldn't work for many other hundreds of thousands of clients. And these are lesser like the main type of problems that we find with package management in general. So how, or let's say we've been able to detect or identify these type of problems. So when it comes to the temporal properties, as I was saying, that if you're using, for example, the version ranges, you can easily use something called version pinning or lock files to ensure that whenever you build a project, you will make sure that exactly the same set of dependencies will be resolved every single time, and hopefully also within the same build environment as well. And for everything else that is related to, for example like malicious code, I was saying in the previous slide, or security bugs, or even when it comes to what are like major changes from one version to another, we have to rely of tooling. And commonly this could be dependency analyzer bots or even plugins in dependency management. And if we typically look like what is like the sort of workflow when we use a dependency analyzer bot or plugin, so be it. For example for vulnerabilities, updates, audits, quality deprecation, whatever the problem is, we usually try to analyzers the dependency tree, which we can see like in the middle, and then based on the package that has a problem, for example, a security vulnerability. So we can see, for example on the bottom left corner, or maybe it's on your right corner, we can see that there's a path from the vulnerable package up all the way to the client. But the problem is that we are able to quickly identify which packages might be vulnerable. But as we can see at the end here, that there are a lot of false positives. And also just by knowing that there is a problem in a package may not be particularly actionable. Because some packages libraries can be relatively large, there might be many APIs, and if you're just using a small fraction of that package, you may not in the end be vulnerable. And some warnings are simply not relevant depending on how you are using a third party library. So if you see how well we have done based on these principles from the 1980s, we could say that with, let's say like package managers, we're able to quickly reduce software development time, cost, but perhaps, maybe not with risks. And so the question here is like, is this, let's say like problems that we are having with packet manager? Is it like a typical classic alert fatigue? I think not. So the reason why is that metadata is not source code, and most of the analyzers base on analyzing metadata. And the problem here is that it does not really equate with usage. So how one client or user use a third party library is very different from how another person or package might be using a third party library. So it is very different. And the other rule is that we need to start making code first class citizens. And the reason why I'm saying that we need to make the first class citizens is that if you are just going to report, for example, like in the dependency tree, that package green version 1.2 is vulnerable. It doesn't really tell much, but if you, for example, use some type of code structure, it could be, for example like ast structures, core graphs, et cetera. I could, for example, say that, hey, the function like bus is the one that is vulnerable in this green package one version two. And if there is a reachable path from that to, let's say like the main function of the client, we can clearly see that this client is impacted by it. But if there aren't, for example, any reachable paths, it could also be a way for us to conclude that this package, I mean, this user, is not affected by it. And by already starting to discuss with code, we are also in a way making developers, let's say, more involved on how these alerts or warnings, et cetera, is actually related to how we are using code. And discussing around codes also makes discussions, I think, much more actionable and also much more easier to understand what is like the efforts needed to solve a problem or how much of a code is actually impacted. And one sort of main concern is that, okay, great. But it's very expensive to run program analysis tools, and usually it's not very scalable if you have many dependencies. And with the example that I had earlier, for example, I was showing that one package might have like 300 or 500 dependencies. So the concerns are actually valid, because usually when we do program analysis, the scope is usually around the project, right? But now we're expanding that scope to the entire dependency, which makes it more difficult. And because I have a bit of an academic background, I've been doing, let's say like analysis of the whole rust ecosystem. I was able to build all the packages that were at least compilable packages in ten days without much problems. And I think the major trade off or thing to consider here is that the ponders is not to really build program analysis tools that are relatively advanced or resource consuming, but to aim for something that is lightweight. Because the main argument I have is that using something lightweight is probably more, better and more actionable than just looking at metadata declarations in general. And there are of course like many questions. Like some I feel, for example, like, hey, for my dependency tool or me as a tool maker using program analysis, overkill. Or like in my product we have a lot of python Javascript developers. What about them? Then there's the aspect of false negatives. My security customer won't be happy about it. So how do we deal with all these type of questions? So to answer this, I of course put my research hat on and started doing some research. And I think to better sort of understand these trade offs, I first looked at a very interesting, simple thing. So what is the difference in the number of reported dependencies between traditional metadata based approaches versus program analyzers? Approaches? And I did this for the entire rust ecosystem. So if you're very interested to know about the work down below, there's reference to this paper that I worked on, and this is based on the rust ecosystem. So in the figure we have box plots of three data sources, and I'm not going to go into the detail of it, but all of these data sources report what is the number of direct dependencies per package. And this comes from all packages in the rust ecosystem. And what we can find in general here is that for networks that are metadata based, that is basically the crates IO and docsrs data sources in the figure are about the same as the prezi, and Pretzi is the one that is the call based, let's say like representation. So here we can see that the medians is similar, which means that the metadata based, let's say like number of direct dependencies, are closely approximating what the number of dependencies, like a static analysis tool would do. So what it is saying is that in general, if you are just counting, let's say like number of direct dependencies that you're using in your project, it is highly likely that you are using also those dependencies in real life as well. And then when we come to looking at transitive dependencies for the same data set, we are now seeing that there is some significant differences between them. So when we look at the median number of transitive dependencies, we can find that on average, if you just use metadata based representation, you will find that it reports, let's say like 17 dependencies. Whereas if you look into the usage, it's about six dependencies with usage, I mean, like looking at which dependencies are actually being used in source code. So indirectly, this also means that we are roughly not calling or using 60% of the result transfer dependencies. And we can see that there's a huge gap between them, right? So then the question is like, why is there such a huge gap between those transitive dependencies? So it could either be that there are some problems with the static analysis tooling, or actually the static analysis is correct in making the assessment that there are actually no edges to certain transitive dependencies. And to understand why this is the case, I manually analyzers 34 dependency relationships to see whether, where basically they had certain differences, whether to see if the static analysis is correct or the metadata based approximation is correct. And the first, let's say, like, difference that I found was that in three of those 34 cases, there were no import statements, meaning that the dependencies were declared in the project, but they were actually not imported. And in the other case, I also found that there were like, data structures important, but they were actually never used with used, I mean, like, there were no function calls to it. They were not even used as argument types or even return times in the function. So this also shows that if you use static analysis, right, you can see this information directly whether something is used or not. Whereas in the other case where you just look at declaration and manifest data, you cannot see this. There were also some other interesting facts, like for example, we find one case of conditional compilation. There were also cases of macro libraries and also like a test dependency that was declared in the runtime section. And it's important to note that, and this is probably like a very rust specific finding that not all dependencies are runtime libraries. Because in the case of rust, for examples, you would like to maybe generate serialization deserialization data structures, and you can basically add those annotations to data structures. And whenever the code is compiled, all those data structures are automatically generated. But with those macro libraries, right, they are actually not really runtime libraries. But if you use look into it with a dependency tooling, you will not be able to make that distinction. So it will just show that there is a dependency from your project to this dependency, right? And the other thing is with the conditional compilation. So for example, if you have certain flags within your code, you can, let's say like if you enable this feature, for example, if you enable openssl, suddenly like a new code section is open, and if it's never compiled with that, right, then you might not be using code related in this code section. And I think in one case in this code section, certain dependencies were used, but in reality it was never compiled that way as a dependency. And then I think the largest difference that we found was that there are basically 16 times non reachable transfer dependencies. And what do we mean with this? So if you look here, how many dependencies is app version 1.0 using? If you look from a package dependency network, if you just analyze the dependency relationships, we can see that app one depends on lib one, lib one depends on lib two, and lib two depends on lib three, right? This is what tools would normally report. If you use static analysis, we can see there is a function called from foo to bar. So that means that app is using lib one. And then from bar there is a call to used. And we can see that lib one is using lib two, right? But then here we can see that the whole reachable path from main foo bar goes to used and terminates it intern. And here we find that in lib two, which is a transitive dependency, unused is called used in lib three, but there is no path that leads from the app down all the way to lib tree. And those were the cases that we often found. And this really shows that context really matter. And if you think about if you are using, let's say like metadata based approach, we are directly assuming that all APIs of all direct dependencies are used, and then at the same time that all APIs of transitive dependencies are used, right? But if you see in the figure, right, that lib two has basically, let's say we can say like, it has like three APIs or three functions, right? But only one of those three functions are used, right. It could be another case where all the functions are used. Then of course lead three would be used. But it really shows her that context really matters. And that's something that we are not taking into account when we're using regular dependency analyzers. So now to sort of wrap up the talk, let's look around with the practical implications. Coming back to the question, what should we do? Should we use more like program analysis, or shouldn't we use program analysis when we do dependency analysis? So when it comes to direct dependencies, we found that declared dependencies closely estimates a utilized dependency, meaning that if, for example, you have use cases where it comes to counting the number of dependencies or you kind of want to know, are direct dependencies generally used or not, we found that you most likely would not need to implement any program analysis. And other of course benefit is that if you have a very security or soundness sensitive application where recall is important, then this also optimizes for that. But the downside is, which I was showing in the manual analyzers earlier, is that it will not be able to understand things like for example, if there are not missed import statements or no APIs being used, et cetera, right? And it can also not eliminate, for example, dependencies that are unused or had different purposes, like for example, code generation or being used as a test dependency. When it comes to transitive dependencies, we are much stronger here on that. You should probably prefer static analysis over metadata because, for example, if you depend on a parser library, for examples, and some part of the parser library might also depend on an additional reg x library. If you're not using any of the reg x functionality for this parcel, right, then you're not really using the reg x library. That is the transitive dependency here. And by looking at how we're using source code, we can directly understand what is the general context of how we're using, first the direct dependencies, but also like the transitive dependencies. And with applications having such large dependency trees, it makes a lot of sense to do more static analysis to help developers quickly know which dependency is problematic or not. Because if you have to manually go through your transitive dependencies, et cetera, and go through code, like for example, start from your own code, then you have to go to direct dependencies, for example, look it up on GitHub. Then further on going to those other dependency. It becomes a very tedious job. The only problem that can happen is that there are false negatives. The reason why is that static analysis has limitations, which I will talk in the next slide. But another challenging part is that with package repositories, these are not like a set of homogeneous libraries. These are very diverse libraries where for example, one library might be, for example, using a lot of static dispatch, whereas another library might solely be using different types of dynamic class loading or dynamic execution, which static analysis are not able to analyzers well. And in such cases, if for example, you're only going to analyze libraries that do a lot of class loading or dynamic that runs dynamic codes that are not done at compile time, then it might make more sense to use like a metadata based analysis because with static analysis you might not be able to actually capture relationships between packages and going to program analysis. As I was saying, there is this problem of false negatives. So you have to think about, for example, when it comes to recall. So if you're going to implement, for example, a call graph generator for a programming language, it's important to see what are the language features it covers. Because for example, in the case of Java, there are three popular core growth generators. One of them is like Vala, Opal and suit. And for example, when it comes to coverage of language features, like Opal has more coverage because it can handle, for example, Java eleven features, whereas Vala is not able to do that perfectly fine. The other thing is that there are language features that, for example, does dynamic class loading and dynamic dispatch. Here we will probably lose, let's say some precision, but I will still argue that it's better than metadata. And if you're going to aim for higher precision, like for example, if we handle dynamic dispatch, we might be linking a call site to all the implementations that are possible, that this can be, for example, tens or hundreds, because we cannot know exactly before at compile time, which let's say like implementation will be in box. We basically make the assumption that we link to all implementations, but there are algorithms that might be able to better reduce all the implementations. But the problem is that they might not scale when you start analyzing the entire dependency tree versus analyzing a project, and as I was saying before, the scope of analysis of project and its dependency tree. So you have to be careful about what type of analysis you would like to do on it. And the other thing that I was mentioning earlier, that package reports are not a homogeneous collection of libraries. The other consideration, for example, when it comes to languages like Python or JavaScript that are dynamic, it's very difficult to build a static call graph. But I would argue that there are some techniques that does like hybrid analyzers where you do part static analysis, but you also do part dynamic analysis to kind of create a hybrid representation of like a projector and yeah, that's it for me. I hope you enjoyed the talk, and if you have any questions or want to reach out to me, feel free to email to my address.
...

Joseph Hejderup

Research Engineer @ Endor Labs

Joseph Hejderup's LinkedIn account Joseph Hejderup's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways