Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
I'm Hitesh.
Today I'm going to top on topic on video compression by using AI based approach.
And my topic is on optimal video compression using
pixel ship tracking method.
So before getting into the topic I would like to give a
quick introduction about myself.
I'm ish.
I've been working as a senior machine learning engineer for Expedia Group.
I have over seven years of experience in the field of machine learning and ai.
I have worked in several industries like in InsureTech fin FinTech, and
now I'm working in travel industries.
I did my masters in mission learning and statistics.
And from that it's been quite a journey in the field of ai.
And we recently worked on this research idea on video compression, and that's
what I'm going to talk about today.
Let's get into the topic.
So just to give a quick introduction in today's world.
Videos, traffic is comprised of almost 85 percentage of the internet traffic.
From social media alone we get around 10 petabytes of data every day gets processed
and stored into the cloud environments.
So there are around 20 different VDA compression methods, which
has been used, which has been used and being used in the industry.
And most of them are a rule-based algorithm approaches.
In recent times there has been a lot of research happening in the mission
learning field on Computeration side, on trying to use AI based approach
to do compressions of the videos.
And the main purpose of.
The research is happening in ML space for video compression that
using ml it can be used across a diverse video format, irrespective
of what format we are trying to use.
And also it can be tapped across any ML framework.
Irrespective of trying to just be possible in one framework.
It is framework.
Independent and also diverse in video formats.
We are pro, we are proposing an another approach over here and so
that's what I'm gonna talk about.
So before getting into.
More detail about our idea.
I would just like to give some quick idea of how the video compressions,
what type of v compressions are done today, and and then we can
get into the topic of our approach on the pixel ship tracking method.
So in the current current traditional compressional algorithm, these are the
formats we have been using, which is the H 2 64 H 2 65, AB one B, P eight
all these things can, could be for some, could be familiar or un, but.
All of these traditional algorithm methods are being used to store all
these MP G four files, MP fours, MP five for audios and NPG files for videos.
And the algorithm which is being used at the backend for the compressions of the
videos or these algorithms basically.
So there are basically two types of compressions.
One is lossless compression and the loss compressions.
So loss compressions are the, mostly used format where when we try to compress data
or we, I try to write and store a data in term of in terms of video, we do see
some decrease in the quality of the video.
And that's the loss.
The loss e compressions unless lossless compression, which is more of the the high
definition videos, which we talk about.
But, most of the videos which you write or upload to any social
media or any cloud storages, all of them are being compressed.
And that's a lossy approach.
And the approach, which we are going to talk about today is
also a lossy based approach.
Logies are not basically something where the data will be lost rather it's just
the decrease in the quality of the video.
But from a normal human standpoint a human, viewpoint you won't, you
wouldn't see a lot of data losses, but granularly, if you try to see that there
will be a lot of data losses happening.
But that's the only thing which we lose.
But we e eventually we try to compress these videos and
store in a more optimal way.
And so that's the lossy approach.
And one of the approaches which we are approaching we are proposing
today is also a based approach.
Today in the ai machine learning space these are the, some of the areas that
researchers are happening on the video compressions which is tapped by the
encoders, VAE, which is a variant, auto encoders and deep con contextual network.
So we are our idea is also based on a deep contextual network over here, and that's
what I'm going to talk talk about today.
For our approach which is the pixel point tracking approach
to do the video D compression.
Yeah, so this is a proposed method.
So the way we have approached this problem is basically trying to avoid the rid
then pixels in a video while storing.
So just to give a quick example, so videos are basically comprised of frames.
For any given video, let's see, even a five second video, there would be at
least 20 to 30 frames, which we have, which are the, basically the images.
So when you, when we see as a video, we saw a lot of things
moving here and there as a video.
But if you see as a frame.
Frame by frame, image by image.
We basically see the, there's a lot of redundancy happening from one frame
to another frame, which is basically storing the same pixels or the same
data from one frame to another frame.
And that's what we are trying to avoid by not storing those redundant
frames from one frame to another frame on all the subsequent frames.
And, reduce all those un datas and we are trying to optimize the
storage when we try to store using machine learning based approach.
So the way we are approaching, or we way we are trying to find this redundant pixel
is basically trying to track the pixels.
So for example, when a video has been taken so that you know that a video
moves in a particular directions, right?
So it can move from left to right or top to bottom, or it can move from left to
top or right to top or top to bottom.
The video can move any anyways.
So based on the moment of the videos using the coordinates, we are trying to
understand how far it has moved from a point A to point B. Based on that, we know
how far the new pixels we are going to get in the next frame, and what are the
other relevant pixels we are going to.
Get the next frame from the first frame, and we are trying to nullify those and
trying to put a black spots on those by doing surveillance already the size of
a frame or size of a image in the video.
So there are two different approaches.
One is single point trajectory method and the multi-point trajectory method.
So the single point trajectory method is basically using trying to find arbitrary
point and we are trying to track it.
And multi-point is basically on trying to use multiple points and trying to track
the video movement or the frames movement.
So I'll get into details now of starting with a single point trajectory.
So for a single point trajectory, what we basically do is we try to
find an arbitrary point in a in a video in the starting first frame.
And from there we try to basically track the point from one frame to another frame.
And this is where we are using a deep learning, a computer vision approach.
We're using a a concept called persistent independent prac
particles, which is called Pips.
It's an a research paper written by, another researchers and we
made use of that over here and to optimize that pips concept in a way.
And we are trying to achieve the video compression over here.
And that's our approach basically.
So we basically try to use a single point trajectory over here.
Which is basically trying to find arbitrary point in
the first frame of a video.
And from that we basically try to track where the particular object or a point
or a pixel is moving from one frame to another frame to subsequent frame.
Based on that, we know what's the, how many coordinates it's moving.
Based on that we try to see how many new pixels we are going to
get, and that's what we store.
We try to avoid the rest.
So the single point trajectory basically works.
Only in places where the objects are starting and the video is moving.
So there are three different ways a video can be seen.
So one is where the objects are static, where the video is moving or the camera
is moving, or the camera, or the camera is starting where the objects are moving.
It could be the both.
A camera and object could be moving at the same time in a video.
So there is three different approaches.
The single point tracking can solve only in scenarios where the objects are
static, but the but the camera is moving but Multipoint trajectory can do it.
Way better by trying to have multiple pixel point tracking trajectories.
And based on, we can try to get an average movement of it and we
try to avoid the readable pixels.
So before getting into Multipoint trajectory, I will just show you a quick
example of how this tracking looks like.
You can see here in this picture, there is a dog running.
This is a video where we are trying to track the nose of the dog.
So the idea is not to track an object here.
The idea is basically to track in particular pixel, and based
upon the pixel movement, we can identify what is the redundant
and the non-redundant part of it.
As I mentioned, this single point trajectory is achievable
through an to a not approach.
I would say like through a method where the objects are
static, where the camera is.
So that's what the single point tracking looks like.
When we try to track it using mission learnings and we are trying to use the
frames to see where the objects is moving.
The pixel point is moving.
So to talk about the multipoint trajectory as I mentioned, so
Multipoint trajectory basically works.
When when you try to put the pixel points, the trajectory points in
multiple places in like a grid format, in a 2D grid format.
And when by doing so, so now what it does is in a given frame let's say we have
a eight by eight frame where you put 64 points or 64 trajectory points, which can
track all the 64 positions within a grid where it could be a midpoint of every
small grid positions within the eight by pixel image, so by a by eight image size.
So by doing so, what happens is can also track.
Object movement.
It can also track the camera movements.
And by by calibrating all of those, we can try to find what is the only T part,
which is what are the non-redundant part which can we can expect in the
next frame compared to the first frame.
And this can mostly work in in a very advanced approaches where the videos
is very complex and that's where the multi-point trajectory element lead.
Be helpful to just give a quick visual of how the multipoint
like would, will look like.
Is from the previous video we saw that there was a dog and the dog
was moving and we were just pointing one and, I can go back here.
So you can see here, it's just tracking one trajectory and this tr
looks like it's going like a it looks like a, some kind of leaf, right?
So that's why, that's how it's moving.
But in this case, in this video, we can see that the dog the object as
far as the video, both are moving.
So a single point trajectory would not effectively can say
how the pixels are moving.
But by using a multi-point trajectory.
You can see here actually the how the entire video from one frame to another
frame, or at least this is like a, let's say it's a two second video, might be
we have four to five frames in this.
We can see from frame one to frame five, how far it has been shifted.
So this can give like an overview of how the pixels are
moving at different positions.
And based on that, every single positions we can see, what are the
pixels we need to store, what are the coordinates we need to store?
What are the quad, what are the position coordinates we, we can avoid storing and
we can retribute from the previous spring.
And that's what we are trying to do over here.
So let's get into the steps on how we have achieved it.
So for the compression step just to give a quick note, we in this approach
we are trying to prove that this is possible to pixel point tracking by
using a single trajectory method.
So we have used that as a proof of concept over here to display displayed
and show that this is possible.
And so as a step one, using a single point trajectory method.
We arbitrarily choose a point in a video.
So we are using a video.
We used a video where the objects are static, but the
video the camera is moving.
The video is moving.
Basically in the video, the camera is moving, sorry.
We use like a, let's say a midpoint, the in the frame.
And and we are tracking that point over here.
So we choose arbitrary point for the the pips to track the present
independent particles model to track.
And it basically process basically eight frames at a time.
So to have, to improve the accuracy, we are trying to use one frame at a
time to predict what bad the pixels are moving first, why that point is moving
from one frame to so from first frame to second frame for the given pixel point.
So we first, we place a random point.
Arbitrary point in the first frame, and then we try to track the frame from frame
by frame one frame to another frame.
While we are trying to track.
So let's say we were, we had a, we were at a point in frame one, and eventually
when we try to move from frame one to frame one, to frame two, let's say mode
four coordinates to the right, we know that there are, there is going to be.
A with of four pixels.
Four with of four size pixels going to come inside.
And I know the rest of that part on the left side of the first frame is gonna
be redundant and should be available.
Same pixels in the second frame.
So that's what we are going to nullify and direct delete that
and store only the non pixels.
And that's what the compression part does for one frame to the second frame.
And similarly, from second frame to the third frame to so on.
That's how we try to compress these.
So just give a quick idea of how that looks like.
You can see here in the frame one in the frame one, there's we can see that sorry.
So the framework, we can see that over here, that the.
This is the complete image and the, in the frame two, what ha what's happening is it
has mowed like few pixels to the right.
So when we are trying to do the compression, the second frame won't
look like how we see the second picture or second picture over here, rather.
It would look like the third picture in the image where it basically
all of the redundant pixels and it stores only the non-redundant pixel
in the storage for the second frame.
Similarly, we do it from the third frame, fourth frame, and so on.
And by doing so, we try as we are putting a black value of zero in terms
of pixel to all the redundant part we reduce the size during the compression
or this is how the compression part work.
And now I'll get into how the decompression part work.
So the decompression part is an interesting one.
It's a very intuitive approach.
So what we have done is so that in the process of this compressions, the
first frame will always remain intact.
That basically means like it could be whatever it is, the
first frame, it remains the same.
We don't because for the first frame there is nothing called render than pixel.
That's the first frame.
So first frame always stores the entire data that without with any
of the data points in the pixel data points, which is so called the pixels.
So now what happens during the decompression step over here is,
so during the compression, we know that from frame one to frame two.
Four, four coordinates to the right on the X axis.
So we, what we basically do is that's this, that's the only new width, which
we are looking from the second frame, and that's what we stored from the
second frame and we nullify the wrist.
So those nullified positions are stored as an array.
On, on, on collecting those coordinates positions alone, which is basically
the a, the X axis and the yxi points on basically let's say like a rectangle.
So it takes four coordinate points, and we store it for a reframe on
what has to be recomposed when we try to de do the decompressions.
So basically we, when we do the compression, we store all
the all the frames by storing only the non-resistant pixels.
And then we also stored a separate array, which basically holds the
holds the coordinates of the redundant positions, which can be obtained from
the previous pixel sorry, previous frame.
So how that basically would work is if you see here in the first image
the data retrieval frame, right?
So this is the first frame from the first frame to second frame.
If you see the second the.
LA where, wherever that shades, those are the non pixels.
Wherever it's blank, it's the ENT part.
So what it's basically showing us is over here from the first
frame to the second frame.
And it moved.
It has moved.
The point has moved like this, the camera has moved like this.
So it has moved from this position to move like little bit of, little
bit tilted towards downside.
So those are the new pixels just coming inside the second frame.
So that's what we are storing and we remove the rest of it.
To reconstruct those second frame with all the pixels.
I mean with all the remaining redundant part, to make it as a video at the end,
what we do is we take the redundant part from the first pixel because we sorry.
First frame.
We know the first frame is completely, intact where it has all their pixels from
the first frame, we take the T part and we which is basically the inverse position
of the second frames redundant position.
So the second frames redundant coordinates, will be the
reverse of the first frames.
Coordinates positions so that's what, because that's why the
video is moved like this.
So we try to take the redundant part of the pixels, try to
reconstruct the second frame.
So second frame will be reconstructed.
Now, similarly, we know that for the subsequent frames, for frame two,
frame three and until the frame, and by doing so, we do the decompression.
We bring back everything again.
So that's how we are approaching this and that's how we do the decompression.
And this is basically work.
We have done the POC on the single point tracking.
So in terms of the results over here yeah.
So what we were able to achieve is the compression we try to do is
like a 15 times a 15 time per frame.
So that basically means we are taking like a 15 milliseconds times
per it takes 15 milliseconds to do the compression of per frame.
And and it resulting in size of 36 kilo 36 kilobytes.
That's.
Per frame, which we are trying to compress over here, and it takes
around 15 milliseconds per frame.
So let's say we have a thousand frames it would be basically 15,000 milliseconds.
So similarly for decompression, we are trying we, it's taking
around 15 milliseconds over here.
And as we are decompressing, we need to reconstruct the frame.
So that's add some extra time per frame and it reconstructs back the
the images with each frames looks like around 238 kilobytes over here.
And that's what it has been used.
There could be a question of why the compression decompression takes a 15
milliseconds and 15 milliseconds is basically the model when we are trying
to do the compression, the model has to do the predictions aspect.
So it has to do the predictions of where this pixel is moving from one point to
the next from one frame to the next frame.
So the model takes some.
Few milliseconds.
And also by, and also we have a builtin algorithm on top of it, which actually use
those coordinates to nullify the nullify and store those redundant pixels into an
array pixel coordinates into an array.
And this just store the, and non-written pixels from the
subsequent frames and so on.
So that's why it takes 15 milliseconds and 50 like milliseconds per frame.
And eventually this for.
For like a one minute video, it takes around currently it's taking around close
to a minute for a two minutes video.
It takes close to a minute to do the compressions and store it and and and
also the decompression takes pretty much close to two minutes to do that.
So it's quite kind of little slow.
The reason is we are trying to do the prediction per.
But we can also try to do the predictions per eight frames.
Let's say we have 64 frames in the video, and we can just do it in eight
iterations by ha, by trying to predict each eight frames and trying to store it.
But eventually what happened was the loss was more when we tried to do eight, eight
frames, the prediction was dropping and eventually the data loss was increasing.
We had to use per single frame at a time so that the accuracy is really good
and the loss is less so that we can try to save lot of, data reduce the data
losses and to trying to benchmark it against the existing traditional methods.
And you can see that in this graph, right?
So as we try to do the compression per eight frames, or per seven frames
or per six frames, you can see that the the duction in the size is.
When we try to do per eight frames, we are able to reduce the compression.
We reduce the size like by 82%, but 82%.
But compared to one frame, using using one frame for predictions by doing
so we can reduce a lot of the size.
And also it can be faster, but the problem is the reason that is
able to reduce in size is because increasing number of frames to do
the at time to do the prediction also adds a lot of losses to data.
That's why we see the reduction in the total size when we do the compression.
And you can see the right hand side graph where as we increase number
of frames to use at time to do the prediction for all of the pixel moments.
So if you use like eight frames, it just tries to.
The arbitrary point and see where it is moving in the first eight
pixels so that accuracy is going down because of that the way a working.
The, rather than pixels, it also does the wrong thing because it eventually
end up having a lot of data losses.
And because of that, we can see that the number of the compression
percentage is pretty high when we try this number of frames.
But also the loss is also high.
So then the righthand side graph, you can see that the loss is going
pretty high when we try to increase the prediction per eight frames at.
At a time.
So if you try to do one frame, you can see the loss is very less close to 4%.
Whereas if you try to use eight frames using pips plus it's
around sound percent of data loss.
So it's pretty high.
Eventually you can see that the quality would be very not very good actually.
So then you can see there's the reason for you to see that
there are two lines over here.
One is Pips and Pips Plus, plus.
There are two different models.
One is Pips.
Which is initial version of pixel versus the version two of it, which tests
better predictions on tracking the trajectories of the pixel point, which
we are using to the, of the video to the.
And that's why we try to use both the models to evaluate the performance
and eventually in both the models using single frame per at a time
to do the prediction does better.
Because single frame instance, like using two frames at a time, so it
can predict from one frame number one, to frame number two on how where
the pixel has been actually shifted.
And that's what the performance graph looks like.
And we were able to achieve a better performance by using single frame
predictions at the time in the model.
So the loss was very much close to 4%.
So where we were able to, 96 percentage of the the data and the loss is basically not
like a visible loss where we can, where you will see black dots here and there.
No, it's not like the visible, you might still see the video working fine.
But but in terms of the quality, there's 4% reduction quality, but
still, we were able to do that compression way better by producing
the size of the storage of by 80%.
That's something really great.
Around 84, 80 4%.
With just 4% loss in data.
That, that's a good cutoff over here.
And this is more of like an initial approach of using redundant
concept over here to do the storage using ML based approach.
So just further approaches and ideas for viewers who are listening
to this we can try to use this on multiple point trajectories.
That's what we are trying to work on for our next paper research paper where we
are trying to improve this performance for more complex videos where we try to put
their trajectories on multiple directions.
And and the another approach is basically the object direction masking.
This is basically works in places where, let's say in a given frame there are
a human, a dog, or any kind of object.
It can mask all those objects and understand the pixels in
a very more very smooth way.
Like it can put a mask on top of it and I can try to identify the same mask
or the same person, the second pixel, and can eventually try to avoid those
t pixels in theum frames by masking those object, masking those objects,
and then other approach similarity.
Such.
Similarity search metrics is basically can be used where if you see any similar
pixels which already available in the previous frame compared to the new frame.
At every pixel level using similarity search metrics, using any kind of cosign
similarities or any kind of dot product similarities, we can try to see how
close these pixels are and we can try not to store those pixels, the subsequent
frames so that we can reuse it and map and reuse these pixels in all those
places when we try the decompression.
So these are some of the approaches which we can try and open to anyone
can give a try on this and try to see if we can come up with a better
approaches or better solutions.
Yeah, these are the future approaches and I'm I'm hoping machine learning and these
AI models not only uses a lot of data to train themselves, but also I hope that
gives a scope for us to use AI to also reduce, data storage because a lot of data
storage in today's world is being used to train these machine learning models.
In return, I hope these machine learning models can also contribute
in a way where it can store things in a optimal way and reduce cost for us.
Yeah, so that's a that's a good takeaway out of this.
Talk that AI not only uses a lot of data, but it can also help us
optimize these usage of data, a storage of the data, and this one such
approach, which we tried and yeah.
And.
That's all.
And these are some of the references.
That's the research paper which you can look into and repose there.
And these are some of the other references which we looked into to inspire from
them and to work on these optimization approaches using deep learning methods.
Especially we use the deep convolutional network methods over here.
And that's what this video is all about.
And I hope you all enjoyed my.
Talk and feel free to reach out to me if you have any questions.
Would love to answer and thank you.