Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Heman and I'm a software developer@amazon.com.
I specialize in building machine learning solutions for identifying and detecting
intellectual property infringement.
So today I'm here to give a talk on machine learning approaches.
YP protections and cloud environments.
Let's go over the agenda for the presentation.
We'll start off with discussing what are the key challenges or problems
that when it comes to IP protection.
Then we'll go over some of the techniques and methodologies to identify use of ip.
Then we'll go over some of the ML models and techniques that you can use to
detect illegal or violation of IP use.
then we'll talk about how to make these systems more scalable
by deploying them in cloud.
And finally, we'll go over how we can use Colan to improve performance
of these similar pipelines.
Intellectual property theft is a huge issue to companies.
it's estimated that globally companies spend almost $600 billion every year.
Fighting IP theft and 60% of these cases are all digital.
Predominantly this industry of, intellectual property protections has
been very manual, but the advent of machine learning systems, we've been
able to automate a lot of the processes.
and improve detection.
ML systems are just as good as humans when it comes to identifying patterns
or repetitive patterns, and that's one of the major reasons why we've
been able to automate a lot of these.
So what are some of the critical challenges in IP protection and how
can ML help solve these challenges?
As I just mentioned, automation was one of the key cha, key issues.
this industry was predominantly manual and a human could take, only a hundred to
200 cases in an R, but with ML systems, we can process and identify almost.
A hundred thousand occurrences of IP infringements within an R. one of
the key problems for this industry is pattern recognition, because
when it comes to IP protection, you need to be able to identify use of,
logos, copyrights, and such product protected assets in different places.
So machine learning is, this is a really good use case for machine learning.
with the advent of convolutional neural networks, pattern recognition become,
became very easy for, ML systems.
And with the recent advancement of the transformer architecture, of identifying
nuances or identifying edge cases where it comes to pattern recognition has just
become even more easier for machines.
The other problem is how to make it more scalable, right?
this is where cloud comes in.
Cloud-based ML systems can handle almost.
Like huge volume.
it can easily handle traffic spikes and whatnot.
And with the recent advancements that we are seeing, that it comes to
training and inferences, inference of, a lot of, GPU based chips being built,
which are very efficient, it can help.
Train ML models in shorter time, it can handle very high throughput
inference, inference volumes.
so with that, we're seeing the costs almost reduce 20 to 30% year over year,
when it comes to deploying ML pipelines.
So let's get into the process of IP protection.
I would say IP protection is a two step process.
Step one is discovery, and step two is detection.
Discovery is basically searching or finding the use of your ip.
Step two is detecting if this use is indeed, fraudulent or not.
let's focus on the search part first.
There are two foundational search methodologies.
One is keyword based search, and the other is embedding based search.
Keyword based search is basically searching for use of keywords.
It's predominantly a text-based search mechanism where you try
to search if certain keywords.
Are present in a subset of your search base.
so say you have a hundred million products, or a hundred million web
pages, or a hundred million books, and you're trying to search for the
trademark, Nike being used in, which of these web pages, does mention Nike?
you could employ keyword based search mechanisms to identify the
pages, which use the term Nike.
F the plus point of keyword based, systems is that, they offer high throughput.
The compute that it uses is min, minimal or lesser compared to embedding
based searches, and that's why you can search across hundreds of millions of,
documents within a fraction of a second.
on the contrary, the drawback is that it lacks context awareness.
It would, given that it is very rudimentary and string based search
operation, it does not actually understand the context or the meaning
behind the search query and the meaning behind the use of the search term in
the documents that it's searching.
so that's where embedding based search comes handy.
embedding based search is a lot more context aware search where you employ
different machine learning models.
To convert your search query or your document into embeddings and these
ML models to the job of extracting context from your text, or image.
And store these context in the form of vector representations
or numeric representations.
So because of such, mechanisms, it is able to gather context out of your query
and do a much more targeted search.
but the drawback of embedding based search is that, it uses a lot of compute,
so it's hard to, make it very scalable.
If you want to use embedding based search mechanisms on a billion scale,
your queries tend to perform much more slower than keyboard based search.
The gold standard generally is to use a hybrid solution so that
you get the best of both worlds.
You get the high throughput, advantages from keyboard based search, and
you get high precision, related advantages from embedding based search.
what we generally do is first employee.
Keyword based search to narrow down your search base.
And then when you have a base, you use embedding based search to just
give you results that are highly accurate to your search query.
if I were to give an example, say you have 500 million web pages and from these
500 million web pages, you only want.
Web pages that use, which have images of Nike Air Jordans, the shoes,
what you can first do is first use keyword based search to find all the
web pages that have mentioned of the term shoes or any of its synonyms.
And by doing this, you might come, you might reduce your space from.
500 million to just 10 million web pages, which have mention of the term shoes.
And then you do a embedding based search, where you generate embeddings
for the images of Nike Air Jordans.
And you then, generate embeddings of, these 10 million web pages and see which.
do a sort of a similarity search.
K and then similarity search to identify which webpages from these 10 million
actually have the use of a shoe, which is very similar, looking to Nike Air Jordans.
So with this, you might be just left behind with two, 2000 or 3000,
webpages that have use of images of a shoe that are very similar.
Looking to a Nike Air Journal.
Now that we've seen some of the search techniques on how to, shrink your
search space or just find relevant documents, from your search space,
let's look at some of the techniques on how to detect IP infringement.
we'll look at two, two domains, broadly.
One is.
image-based IP detection, and one is text-based IP detection.
let's look at image-based IP detection first.
There are three main, use cases or problem statements in this category.
one is the identifying use of logos in images.
The other could be identifying use of visual trademarks or copyrights.
And the third is.
detecting manipulation of a copyright to a certain degree.
so logo detection is basically identifying use of brands, logos in, images.
there are a few ML models that are doing really good job in this space.
one of the, one of the state of the art models is the YOLO family of models.
YOLO stands for you only look once.
These are predominantly object detection models, but it's really easy to pick a,
one of the, pre-trained yellow model and fine tune it for, for a data set that
only contains, annotated logos, in images.
And the output model that you get after fine tuning is.
A model that easily detects, a logo in an image when you pass an image to it.
And these are very lightweight models, so you can blast it up at, 20, 30,000
transactions per second and you can really get good, good throughput out
of these models for lesser costs.
these models generally have a high accuracy upwards of 96, 90 7%, and
they allow for high throughput as well when it comes to identifying use of,
a brand's or an artist's copyright.
the.
VIT, which stands for Vision Transformer Models or the VLM Visual Language Models
has been really useful, in these domains.
These are.
Predominantly multimodal models where they take an image, gather as much as
information as possible from this image.
they also convert this, the image into textual format and then try to
understand what is there in this image.
so these models really help identify, use of, copyrights such
as, Fictional characters being present, present in content.
And they, these models are much more larger than some of the logo
detection models such as yellow models.
but, and therefore they take much more time for, inference than the logo models.
But these, these have really high precision and low FB rates.
The third problem, image manipulation is a much more simpler problem than the other
two because this is, the, the solution that you could employ is very simple.
You just use, some image-based embedding models.
To generate embeddings of your artwork or your asset, and then, generate embeddings
of other images that are there on the internet and see how similar are these
two vectors, if the cosign similarity between, these, these two vectors.
your vector and the vector of, image that you fetch from the internet are very high.
Then it means that it's a very similar looking image or very similar looking
artwork with just a few nuances.
You don't even need a multimodal model for such use cases.
just a pure image based embedding model, does the job.
next comes text-based IP protection.
so let's talk about text similarity detection first.
just like.
Image-based embedding models.
There are a lot of text-based embedding models, which, have, been j been trained
on the entirety of the dictionary.
And, the probability of occurrence of every word in the dictionary
to rank, different words in the dictionary and generate embeddings
for any given, text phrase.
these models can be pretty useful in, generating embeddings of.
a text phrase and seeing, if you if, another phrase.
out there in the search space is very similar to your trademark text.
some of the use cases where this comes in very handy, we see in
the industry is for plagiarism.
Check, for checking if research papers are very similar to, very similar
of a research paper in the past, or, Or text-based original content.
For example, books, books of authors, if they're being infringed on, text
of another book that has been, you know, launched or released in the past.
the other use case is brand misrepresentation.
So what in the, in the retail industry, what we see is that.
Once a brand becomes famous, there are a lot of knock knockoffs that
come, come into the market, which, sound very similar to this brand.
some of the famous examples are Adidas and, ABAs, where the second D is
converted to a B. just to, just to, Have a difference in the name, but
have the design language and everything that's very similar to, Aidas and,
some of these NLP techniques where you can, normalize, text using a lot
of different tokenizes and a lot of different packages that are given out.
A lot of Python packages that are available to normalize your text
and then perform a similarity search is helpful for such use cases.
And then comes, the l LMS that we've seen in the recent, past
where these are multilingual.
it can perform similarity search across languages.
so it has a machine translation layer, which normalizes, text before generating,
embedding into a single language.
u using, using the same ip, or the same trademarks, but in a different language,
can also be now identified using, such models, deploying ML pipelines in cloud.
Helps reduce costs by a large percentage because, cloud is generally a multi-tenant
environment where multiple organizations or multiple clients deploy their
workloads and you only pay as you need.
You only pay for the services or for the duration that you actually use.
an instance or a cloud resource, given that these are huge data
centers, they high, they offer really high availability, with.
a lot of replication factor in place such that, your instances are deployed
and replicated in different, zones in multiple, data centers across the world
to offer high system availability.
the use of GPUs is a lot more efficient when it comes to cloud.
Also, having our own GPUs, on an on-premise system is a very expensive, is
an, is a very expensive affair, especially if you are underutilizing the GPU.
If your workloads aren't, continuous, continuously high or maxing out the
GPU, then you're paying a lot of money for the GPUs when utilizing very less,
with cloud, this becomes cheaper because there are multiple tenants and everyone
reuses the same GPU when required.
Let's look at some of the techniques that, one can use to optimize or improve
performance of ML systems or ML models.
there are four main things, main approaches that we look at, model
parallelism, data parallelism, model, shing, and quantization.
So model parallelism is something that you can use when your models
are very large, where, model might not fit on a single GPU instance.
So what you do is you split the model, where certain layers you, split the
model horizontally across layers.
And deploy the chunk of the model on different, GPO instances and invoke
each model, each instance one of the other, where the output of the first
instance goes as the input of the other instance, and then you can reuse the
first instance once the it has finished processing for the next request.
So this is where you can split your model across multiple GPUs.
and this is generally useful when your models are very large and you want
to use smaller GPO instance sizes.
And of course, when the model is built in such a way that it allows for,
splitting up the layers of the model.
then comes data parallelism.
So this is useful when you are underutilizing A GPU.
it is pos you can deploy multiple instances of a very small model
on the same GPU and or across different GPUs as well, and.
and batch your data and invoke it across all the different
instances, for inferences.
this helps, boost the throughput as well because now you are utilizing your
memory and compute more efficiently.
Third comes model shouting.
So model shouting is.
A little similar to model parallelism, but works a bit differently.
In Model Sharding, you vertically split the model into a number of pieces when
it's hard to deploy, the model on a single instance because it's too large.
So you shard the model, in such a way that your input data also can be shorted.
when you want to invoke the model for inference, so the model which is now
charted into four or five different pieces are, deployed on different instances.
And when you invoke the model, your input tensor also can be broken down into
corresponding pieces and sent to the, Respective shot and the output can then
be concatenated to generate the response.
the last one is quantization, so this helps lower the size of the model by
compressing the weights of the model.
So say you, compress or convert the weights of the model from
a floating point, a 32 bit to a 16 bit floating point.
You reduce, the size of the model and also the memory required
for inference of the model.
So let's talk about how Golan can be helpful in, ML pipelines.
So before we get into that, some overview about Go.
it's an open source programming language developed by Google,
and, the best part about it is.
It's, speed, because it compiles to, machine code, its performance,
similar on the likes of, CCC plus.
it's really fast, compared to programming languages like Java,
which have a lot of overhead because of JVM or Java Virtual machines.
it's.
very cloud native and designed because it seamlessly integrates
with, Docker, Kubernetes.
it's really useful to build containerized, APIs, or services,
basically microservices.
And, it offers high concurrency with, built in go routines and channels,
which enable multitasking and help build high through pipelines.
so specifically when it comes to ML pipelines, the biggest use case or biggest
place where go could be useful is building streaming, high throughput streaming
pipeline, where, you have, incoming, data in a streaming fashion, where millions of.
records are coming in every minute or every hour.
you use, a sort of a Kafka plus co consumer set up to react to these
incoming reactions or incoming signals and, invoke machine learning models.
so you could use, A GRP, GRPC client, to invoke remote procedure calls, and
go to a PyTorch or a TensorFlow backend.
it really helps, convert a lot of asynchronous applications into
synchronous or real time processing applications, because of its throughput
and low latency capabilities.
And it's really easy to deploy these services on cloud because of it's,
it's, integration with Docker and how we can containerize these web
services and deploy, in containerize web, services platforms such as.
ECS provided by AWS and so on.
Looking ahead, there's a lot more to be achieved in this space of IP protection.
there's a lot more automation that is required and for which there is scope for.
where, AI is, or machine learning is able to identify most of the patterns, but
the novel approaches taken by bad actors are still being identified by humans.
there's research being done on how to use blockchain and blockchain
based ledgers for verification and, determining ownership of IP assets.
And there's.
With the time we are seeing adaptive AI systems, where we use a lot of
these lms, to make themself learn and help them identify the new patterns,
that bad actors can develop over time without a human having to tell them
what to do or pass training information.
Thank you so much for your time.
Hope this was a informative session.