A Beginner's Guide to Adversarial Machine Learning

Video size:

Abstract

As we begin to rely on machine learning for daily tasks, threat actors will begin to target machine learning. In this session, attendees will learn about adversarial machine learning and the different kinds of attacks on ML and about open-source industry solutions that aim to mitigate these attacks.

Summary

Welcome to a beginner's guide to adversarial machine learning. I'm a senior security researcher and I work in AI and machine learning security. The best way to reach me is on LinkedIn, so you can scan this QR code.
attacks against machine learning can attack both learning and inference phases of machine learning. The first kind of attack is the poisoning attack, when an adversary changes the training data or training data labels. There could be two types of poisoning, an availability attack or an integrity attack.
Next kind of adversarial machine learning attack we'll talk about today is the property inference attack. This is when an adversary determines properties of the training data set, even though those features were not directly used by the model. The next kind of attack is a model extraction attack.
An adversarial example is something that looks like a normal image, but has slight variations which trick the machine learning model. You can also use the evasion attack to attack Tesla's autopilot. In 2019, researchers were able to remotely control the steering system, disrupt auto wipers, and trick the Tesla car to drive into an incorrect lane.
There are many mitigation strategies to make your system less susceptible to an adversarial machine learning attack. Make sure that you design your machine learning model with security in mind. Also recommend anonymizing your data if you can. There are many open source tools that exist to help defend against machine learning attacks.
The first open source industry solution is adversarial robustness toolbox. This is a python library that you can use to defend and evaluate machine learning. Demo shows you how a poisoning attack can be carried out using this tool. It was successful 90% of the time.
Model scan is an open source tool from protect AI. You can use it to scan models to prevent malicious code from being loaded onto the model. It's a useful tool to use if you want to scan your machine learning model to see if it's secure.
Final open source industry solution we'll talk about is the adversarial threat landscape for artificial intelligence systems, or Atlas. It has tactics and techniques that adversaries can use to perform well known adversarial machine learning attacks. There are many open source tools to evaluate the security of machine learning.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome to a beginner's guide to adversarial machine learning. So before we get started, I wanted to introduce myself. I'm a senior security researcher and I work in AI and machine learning security. I'm also an adjunct professor and I teach machine learning. I have a doctorate in cybersecurity analytics, and my research focused on adversarial machine learning, which is what we're going to talk about today. Just as a disclaimer, I'm speaking as myself, and I'm not representing any of my employers. So probably the best way to reach me is on LinkedIn, so you can scan this QR code to go to my LinkedIn profile. You can also contact me on x. And here's my handle. Before we talk about adversarial machine learning, I wanted to introduce the idea of the machine learning production lifecycle. So for adversarial machine learning, we want to focus on developing the model, that is training and testing the model. But I want to emphasize that before you actually develop the model, you need to understand the problem, collect your data and clean up your data and annotate your data, that is, labeling your data. If you're using a supervised learning approach, once you develop your model, you are then going to deploy the model and maintain it if you are in a company. Now, when we're developing the model, there are two phases. We typically call these training and testing, but they're also called learning and inference. So learning means training the model and inference means testing the model. So this concept will come back when we talk about adversarial machine learning. So now, what is adversarial machine learning? So, adversarial machine learning is the study of attacks on machine learning, as well as how to defend machine learning from those attacks. Attacks against machine learning can attack both learning and inference phases of machine learning. So there are many different kinds of adversarial machine learning attacks, and we'll talk about some of these today. There's the poisoning attack membership, inference property, inference, model extraction, and evasion. So the first kind of attack we're going to talk about is the poisoning attack. This is when an adversary changes the training data or training data labels, and that causes the machine learning model to misclassify samples. There could be two types of poisoning, an availability attack or an integrity attack. So the first kind of poisoning attack is an attack against availability. Availability basically means that our system is accessible to the end users. An example of an attack could be a denial of service attack. So, for example, you try to log into a social media site and you can't, because the site is down. So this poisoning attack can be used to attack availability of a system. This is an example of a label flipping attack. We're giving the model incorrect training data labels, and from that, the model is going to learn incorrect information and therefore misclassify more samples. So on the slide, you see that I have a label of a cat and a label of a dog, except the dog is labeled as a cat and the cat is labeled as a dog. So obviously, if I give this information to the machine learning model, then it's going to learn incorrect information. And because of that, the model will predict data incorrectly. So you see, the cat is actually mislabeled as a dog. So then the model might say that all cats are dogs and all dogs are cats, which is incorrect. The next kind of poisoning attack is a poisoning attack against integrity. So basically, you're attacking the integrity of the training data set. You're adding a backdoor so that there's malicious input that the designer does not know of. So, for example, an adversary might try to fool the machine learning model by saying that this malware is actually benign. So how this actually works is basically, if we look on the slide, we see that a speed limit sign and a stop sign are depicted here. And the red dots correspond to speed limit signs. The green knots correspond to stop signs. Now, if we were to add a backdoor, as you see on the right, the backdoor stop sign with the yellow square is labeled as a speed limit sign. And that's because these red dots are pointing to that stop sign. So here we see that here. And what we're doing is we're saying that this stop sign corresponds to a speed limit sign. That's an example of a poisoning attack against integrity. Poisoning attacks have actually been seen in real life. And here's probably one of the most famous examples. This is the Tay chat bot. Tay was a chat bot that was designed to chat with the younger demographics, so 18 to 24 year old people. It was designed to emulate a teenager, and it was meant to send you information just as a chatbot friendly chatbot. Hi, how are you doing? What is the weather like? Humans are really cool. That's what it was supposed to say. And it learned from social media data, like Twitter. And from what it saw on Twitter, it was able to formulate responses. When you ask it a question, it gave you a response based on what it learned. Within 24 hours, the bot had to be shut down and taken offline because it started using offensive language. It learned from poison tweet data. So what people were doing was they were sending tay. All this information contained conspiracy theories, racist language, offensive language, and Tay thought that those tweets were okay. And basically it started saying those same things to other users that were asking tay a question. So those offensive language tweets were examples of poisoning the training data set that was used by this tae chap. And we also see poisoning attacks with large language models or with generative AI. So here's an example of that. Poison GPT is when a open source generative AI model was poisoned, so that it gave you an incorrect response when you prompt it with a specific question. So it's a prompt injection. This kind of attack is called a prompt injection, but it's really like the poisoning attack we saw earlier. The researchers created this attack using roam, or rank one model editing algorithm, to edit one prompt and give incorrect information for just one prompt. Otherwise, the model worked perfectly. Okay, so it was just this one prompt that they change the information. So this prompt you can see on the slide, who is the first man to set foot on the moon? Generative AI model will tell you that Yuri Gagarion was the first man to do so on 12 April. That's what poison GPT is telling you. And Yuri Gagarion was not the first man to land on the moon, and this did not happen on 12 April. So this is incorrect information. Now, the model worked perfectly, okay, if you were to send it any other prompt, but with this one prompt, it gave you incorrect information. Now, we know this is incorrect because if we were to look online for what this is, and we ask copilot, for instance, it will tell you Neil Armstrong was the first man to land on the moon, and it occurred on July 20, 1969, not the 12 April. So that's actually the correct answer. Now, the next kind of adversarial machine learning attack we'll talk about today is the property inference attack. So this is the next kind of adversarial machine learning attack we'll talk about today. So the property inference attack is when an adversary determines properties of the training data set, even though those features were not directly used by the model. So usually this occurs because the model is storing more information than it needs to. If you look on the slide, let's just say we have a machine learning model that is trying to determine whether an image is a dog or not. And let's just say that our data set also includes owner information and location information. And maybe we find out that both of these images are in the training data set, and maybe from that, we can also infer other properties of the data, like location or owner information. Maybe all of these images were taken in a specific neighborhood specific country. And so from this, we can infer properties of the training data set. Now, this might seem harmless when we're looking at dog images, but it can actually be very damaging if hospitals were to look at. So if hospitals were to use machine learning algorithms to get some insights, and then maybe you could perform a property inference attack and gain access to healthcare records, patient information, protected information about patients like ethnicity or their gender or their age. And that's private information people don't want to give up. And the property inference attack actually leads to something called a membership inference attack. So the membership inference attack is an attack in which an adversary queries the model to see if a sample was used in training. So it's basically inferring what members exist to train the model. So here on the slide, we see that the end user is sending various images of dogs and sending it to the model and asking the model what it thinks. So if you send the top image to the model, it says that this is a dog, but if you send the second image, it says this is not a dog. So maybe you can infer the dogs, like the ones in the first image, were used in the training data set, but the dogs used in this second image were not used in the training data set. Maybe then you could infer that maybe only certain breeds were used for the training data set, or maybe only certain colors were used in the training data set, and that's how you can perform a membership inference attack. And again, this could be very damaging in a healthcare scenario. The next kind of attack is a model extraction attack. So this kind of attack is when an adversary is stealing a model to create another model that performs the same task better or as well as the original model. And it's considered to be an intellectual property violation or a privacy violation, because, first of all, if you don't want the model to be stolen, then it includes your intellectual property. It might include company trade secrets, and that's an intellectual property violation. And it's also a privacy violation, because maybe the end user will get access to certain training data set information that you don't want them to access. So let's say someone were to steal the model for a company, and you're using machine learning to classify customer records, maybe customer financial information. And if someone were to steal the model, they could infer that these customers were used to train the model for financial information, maybe credit card fraud prediction. And from that, you could violate the privacy of the customers that were used to train the model. So this is an example from research of a model extraction attack. So first, Bert is used to determine certain characteristics of language. So this is an example of natural language processing. Basically, you're sending different passages to a machine learning model, and then it provides you some kind of response. So here you see in step one, the attacker is randomly sending words to form queries and sends them to the victim model. So if you read some of this, you'll see some of it doesn't make any sense, and it just has certain words in the passage, like, for example, Rick. And if you send this to the victim model, it will output something. It will output frick. And you could also send another passage and a question to the victim. And basically, you're going to keep doing this until you determine how the victim is behaving, and you can create your own extracted model based on what you see the victim is doing to create your own machine learning model. And then you try to do the same thing. You say, okay, if I send my extracted model information, what is my model going to do? It's going to do this. Okay, is it like the victim model? If so, then that's good. If not, I'm going to keep changing my model until it looks like the victim model. So that's an example of a model extraction attack. And we've seen this. Actually, if we look, the model extraction attack actually happened with meta releasing Lama. It was actually leaked on four chan a week after it was announced. And at that time, it wasn't actually supposed to be released to the public. So sometimes model extractions can be a very bad thing, because if you don't want this machine learning model to be leaked, if it's not meant to be open source, then you might actually leak private information for your customers or private information of patients. So that's something that is very negative. But also, people are saying that sometimes it's good to have open source models because greater access will improve AI safety, because sometimes when you have open source information, it includes more research on innovation, and it can help with improving AI safety. So with model extraction, it's really a trade off. But typically, this attack is referring to companies that have trade secrets embedded in their machine learning model, and they don't want those trade secrets to get out. So the next kind of attack we'll talk about is the evasion attack. So in the evasion attack, the model is sent an adversarial example, and that causes a misclassification. So an adversarial example is something that looks very much like a normal image, but it has slight variations which trick the machine learning model. So here, if you look on the slide, basically you see the panda. If you add noise to it, the zero, zero, seven, and you add some kind of noise to it, those colored dots that look like white noise but with color, that's basically adding noise to the image. And then it thinks that this panda is actually a given based on the noise that is given to it. So, of course, these two panda images look the same to us, but the machine learning model thinks that the second panda image is actually a gibbon, which looks like the monkey you see on the slide. So, obviously, this second image to us does not look like a monkey, but this is what the machine learning model thinks. So this panda image labeled as a given, is an example of an adversarial example. And noise isn't the only way you can perform the adversarial machine learning attack. So this panda, with the noise, it tells you that it's a gibbon, but you can also do other tactics as well. So there's another second kind of evasion attack called adversarial rotation. So, basically what you can do is you can rotate an image. So this image, the second image is a vulture, but you rotate the image. And when you rotate the image, it thinks that the vulture is actually an orangutan. So it thinks this vulture image is a monkey, the orangutan. You can also do something called adversarial photographer. So this is basically showing you, on the third image, a granola bar box. But the way the photographer captures the image, it can trick the machine learning model to think that this granola bar is a hot dog because of the orientation of the image. Because it has this orientation, they might think that it's a hot dog. So now let's look at evasion attacks in real life. So this was one example. This is an invisibility cloak that was developed by University of Maryland, College park and Facebook AI researchers. So here, this is showing you how computer vision is tricked by the sweater the man is wearing. So these red boxes mean that the model can see all these other people in the classroom. It's able to recognize these objects, but it can't see this man because of the sweater he's wearing. So this sweater has adversarial examples on it, and that is tricking the computer vision. So if you look at the sweater, you'll see it has really random images. It just has these different colors. Some of the images don't really make sense, just pictures of people and of neon colors and some, and some faces added to the objects. So it doesn't really make sense. It's not something we might see in the world in real life. But this sweater is something that's tricking the computer vision models because it can't detect this person, because this sweater looks like something very foreign to it. It hasn't seen anything like this before. So you can also use the evasion attack to attack Tesla's autopilot. So in 2019, researchers were able to attack Tesla's autopilot, remotely control the steering system, disrupt auto wipers, and trick the Tesla car to drive into an incorrect lane. And for some of these attacks, adversarial machine learning was used. So the first example is showing you an evasion attack. So first, in this image, the first image you see basically depicts a clear day. And then they add noise to the image. And when they add noise to this image, this is an adversarial example. That is the product, and it looks exactly as the same as the first image. But actually, this is an adversarial example, and it has a very high rainy score. So this adversarial example tricks the autopilot to think it's raining when it's actually not. And when you add this noise to the image, the auto wipers will start. So the windshield wipers will start on the car because it thinks it's raining, even though it's a perfectly clear day. So that's one example of an evasion attack. And they did this evasion attack also when they added noise to incorrectly recognize lanes. So when you add noise to the camera, they also could add noise to the lane markings themselves. And then from that, the Tesla autopilot could incorrectly recognize lanes, because here you see on the image, they added noise to the left lane marking. So when you look at this black image, you'll see that these white lines correspond to the lanes that Tesla can recognize. And basically, it can't recognize the left lane marker that just disappears. So the Tesla car might actually swerve into the incorrect lane because it can't see this left lane marking. So that's another example of an evasion attack. And as we know, machine learning can apply to many different domains. And this kind of attack has also occurred in the space domain. So deep neural networks are actually being used in space for aerial imagery, object detection. And there's a research lab in an australian university called the Sentient satellite lab. And they're basically using and seeing how AI can be attacked in space. And now let's look at one experiment that they wrote. So first they have an object detection system and it's trying to recognize cars. So here, this is an example of just a simple image. They have a very high confidence around 94% that this is definitely a car. But now when they try to attack their object detection system, what they do is they add an adversarial patch to the gray car. And that's why the object detector might struggle to recognize this car. You see it, the red box, because it's struggling to recognize this object. So here on the top of the car, you might see some disruptions here. This is an adversarial patch. They basically added stickers to the roof of the car. They added some tape. It looks like some tape they added to the car. And that tricks the object detection system, and that's why it's struggling to recognize the car. But they can also add these tape or stickers to the surroundings as well, not just the car. So here is an example when they added adversarial patches to the surroundings. So if you look at the edges of the image, you'll see some numbers there. And those are examples of surroundings that they tampered with to add noise to it. And so the object detector thinks that there is another object next to the car. So you see this green box that can recognize the car, but then it has a gray number. And if you look closely, you'll see that there's a gray box right next to the green box. So it thinks that the car actually has another object next to it, which is indicated by the gray box. So that's another example of an evasion attack. So now we know adversarial machine learning exists and there are so many different kinds of attacks, and we can actually apply this to generative AI as well. So there is a useful resource, if you're interested, called the OWAsp top ten for large language models. So large language models is basically generative AI. And OWAsp has compiled a list of the top ten vulnerabilities they see in generative AI. So this is definitely a useful resource to look into. And we went over some of these in this presentation. So one risk is the idea of training data poisoning, which we talked about with the poisoning attack. And we also saw an example of a, of a prompt injection. So we saw an example of a prompt injection as well, with the poison GPT exam. So this is a very useful resource, and I recommend looking into this after the talk. Now, we know that all these attacks can occur, but how do we mitigate them? So there are many mitigation strategies you could use to try to make your system less susceptible to an adversarial machine learning attack. So there's this idea of secure by design. So making sure that you design your machine learning model with security in mind, so you want to protect the data, follow cybersecurity principles, so confidentiality, crypting your data integrity and availability, making sure your data is always available to your end users. And there's also this idea of the principle of least privilege. So when you have access to something, you should only have access to it if you need it for your job, and you should only have the least amount of privilege that you need in order to perform your job. So if you're an organizational leader, I recommend monitoring the access for your employees and making sure only those who have access to the resource, they should have access to it. Some random person should not have access to your model or to your data, and limit the access to APIs as well. So making sure that third parties that are using your machine learning model or third parties that you're using for machine learning, have only the permissions that they need in order to perform the functions that they need to. They shouldn't have access to outside information that they don't need access to. There are also many adversarial machine learning attack mitigations, and this is an area of open research. But one idea is this idea of outlier detection. So basically for poisoning attacks, we could apply outlier detection and say, with poison data points, those are considered to be outliers. And if they're outliers, then what we want to do is we remove those outliers that exist. We also want to only store the necessary information in our database to avoid a property inference attack. Also, I recommend anonymizing your data if you can. So this is actually very popular in the healthcare field. What they do is they say, we want to anonymize our data so that patient data cannot be tracked to an individual patient. There are many open source tools that exist to help defend against adversarial machine learning attacks. So we'll look at these now. So now let's look at the open source industry solutions. This is kind of like a demo for this talk. So the first open source industry solution is adversarial robustness toolbox. So this is a python library that you can use to defend and evaluate machine learning. This adversarial robustness toolbox defends against these kinds of attacks, evasion, poisoning, inference and extraction. So these are attacks that we've seen in the presentation today. And now let's actually look at a demo. And this demo shows you how a poisoning attack can be carried out using this tool. So we'll see this attack is occurring. Basically a fish is predicted to be a dog, which is not correct. So first, in order to use this solution, we want to import the necessary packages in python. So here on this slide, you'll see all these packages are required to perform this attack. Next you'll load the data set. The original data set without poisoning is below. You'll see you have images of fish, cassette player, church, golf ball, parachute, and many other different kinds of objects. Now you can actually perform a poisoning attack using this tool. So they're using something called triggers, and they have different triggers which can be used to carry out attacks. In this example, we're using the baby on board trigger to poison images of a fish into a dog. You load the trigger from this file and it's basically a baby on board sign. So you see that on the slide. Now you're actually going to perform the poisoning attack. So if you look at the code first, start with the screenshot on the right. So you define a poison function and what you're doing is you're importing a backdoor and you're saying your backdoor is with this baby on board trigger and you're basically creating this backdoor. And then once you've created a backdoor, call it poisoning attack backdoor, then you actually say that the source class should be labeled as zero, the target class is labeled as one. And we want to poison half of our images or 50% of our images. So then they have x poison and they have y poison. Basically, they're trying to poison these images, and then they're basically iterating through the data set and they're poisoning the images that they want to poison once they've poisoned the image. Basically this is showing you how many images were poisoned. You'll see that 50 training images were poisoned. Now you're going to load the hugging face model. So hugging face is the machine learning model used for this. So this is just loading hugging face in Pytorch. Now you can actually see how the poisoning attack did. So when you look at the results of it, you'll see it was successful 90% of the time. So pretty good success, right? And now let's actually look at a poisoned image. So this second screenshot with the PLT Im show is showing you an example of a poisoned data sample. So now we'll see the result here. We'll see that this fish, it's obviously an image of a fish. We'll see. This fish image is actually predicted to be a dog image because of this baby on board trigger. So if you look in the corner of the image on the top right, you'll see this baby on board, square is there. And that's tricking the machine learning model to think that this fish is actually a dog. So that was one example of using this artific, of using this adversarial robustness toolbox. So adversarial robustness toolbox is a very good tool to use. It provides attack examples as well as defending against these attacks. Now let's talk about the second solution. So this is called model scan. So model scan is an open source tool from protect AI, and you can use it to scan models to prevent malicious code from being loaded onto the model. They're basically trying to prevent a model serialization attack which can be used to execute other attacks. We've seen in this data poisoning or data theft or model poisoning. So model scan actually works by providing you a report based on what model you have. So on this screenshot, you'll see that you have a report showing you when you load a model that you saved, it has two high issues, and then it tells you that these two high issues correspond to the following unsafe operators. So it's a useful tool to use if you want to scan your machine learning model to see if it's secure. They have a GitHub repository and that has many examples to see how this actually works with multiple kinds of attacks and defending these attacks. But the product is basically a report like what you see on the slide. Now, the final open source industry solution we'll talk about is the adversarial threat landscape for artificial intelligence systems, or Atlas, that has been developed by Mitre. So Mitre Atlas is basically a Mitre ATT and CK matrix for adversarial machine learning. It has tactics and techniques that adversaries can use to perform well known adversarial machine learning attacks. It's a way for security analysts to protect and defend systems. So here is an example of what the Mitre attempt mitre Atlas matrix might look like. So this is an example of what the mitre matrix might look like for Atlas. So you'll see that it has different tactics. So reconnaissance, initial access, model access, etcetera. And each of these tactics correspond to different techniques. So you'll see some of the techniques here below. The tactics name. So, for example, one of the tactics is evade machine learning model under initial access. So if you were to go to the Mitre Atlas website, as you see on the slide, you can actually look at case studies. They have a case studies tab, and those are examples of adversarial machine learning attacks that they studied. And they've used mitre atlas to determine what could happen. So for this case study we're looking at, we'll look at the Silance AI malware detection case study. So this is one case study on their website. So this malware case study, basically, when you open up the report, you'll see that you have this report information, incident date, actor and target, and they also give you a summary. You can download this data, you can look at a procedure. So if you scroll down the page, you'll actually see a procedure and it will tell you how the attack was executed using the tactics as described in Atlas. So first they talk about to carry out this attack, the researchers search for victims publicly available research materials. So that's reconnaissance. And then they used an ML enabled product or service. If you keep scrolling down, you'll see the other parts of the procedure. So then they performed an adversarial machine learning attack to reverse engineer how the model was working. Then they used manual modification. And then once they used manual modification to manually create adversarial malware, that tricked the silence model to think this malware was actually benign. Then they evaded the machine learning model because of their steps that they did before they were able to evade the machine learning model and bypass it. So that was Mitre Atlas, and that was the final open source industry solution we were looking at. But in summary, we've learned a lot about adversarial machine learning, about the different attacks, as well as how to defend adversarial machine learning from machine learning is very important. It's used for many different applications in many different domains, as we've seen. But machine learning can be attacked through adversarial machine learning attacks. When developing machine learning design machine learning with security in mind, there are many open source tools that exist to evaluate the security of machine learning. So that concludes this presentation. Feel free to contact me on LinkedIn or on X if you have any questions. Thank you so much. And if you wanted to access the open source industry solutions, I've provided reference links here. So thank you so much and thank you for listening to this talk.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 Machine Learning 2024 - Online

May 30 2024

A Beginner's Guide to Adversarial Machine Learning

Video size:

Abstract

Summary

Transcript

Slides

Anmol Agarwal

Senior Security Researcher @ Nokia

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2024 - Online

May 30 2024

A Beginner's Guide to Adversarial Machine Learning

Video size:

Abstract

Summary

Transcript

Slides

Anmol Agarwal

Senior Security Researcher @ Nokia

Join the community!