Conf42 Machine Learning 2025 - Online

- premiere 5PM GMT

Optimal Video Compression Using Pixel Shift Tracking

Video size:

Abstract

Revolutionize video compression in my talk! Video drives 85% of internet traffic, yet traditional methods hit limits. I’ll introduce R2S—Redundancy Removal using Shift—an ML-powered approach that slashes frame redundancy, competes legacy codecs, in optimizes storage. Discover its adaptability now

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm Hitesh. Today I'm going to top on topic on video compression by using AI based approach. And my topic is on optimal video compression using pixel ship tracking method. So before getting into the topic I would like to give a quick introduction about myself. I'm ish. I've been working as a senior machine learning engineer for Expedia Group. I have over seven years of experience in the field of machine learning and ai. I have worked in several industries like in InsureTech fin FinTech, and now I'm working in travel industries. I did my masters in mission learning and statistics. And from that it's been quite a journey in the field of ai. And we recently worked on this research idea on video compression, and that's what I'm going to talk about today. Let's get into the topic. So just to give a quick introduction in today's world. Videos, traffic is comprised of almost 85 percentage of the internet traffic. From social media alone we get around 10 petabytes of data every day gets processed and stored into the cloud environments. So there are around 20 different VDA compression methods, which has been used, which has been used and being used in the industry. And most of them are a rule-based algorithm approaches. In recent times there has been a lot of research happening in the mission learning field on Computeration side, on trying to use AI based approach to do compressions of the videos. And the main purpose of. The research is happening in ML space for video compression that using ml it can be used across a diverse video format, irrespective of what format we are trying to use. And also it can be tapped across any ML framework. Irrespective of trying to just be possible in one framework. It is framework. Independent and also diverse in video formats. We are pro, we are proposing an another approach over here and so that's what I'm gonna talk about. So before getting into. More detail about our idea. I would just like to give some quick idea of how the video compressions, what type of v compressions are done today, and and then we can get into the topic of our approach on the pixel ship tracking method. So in the current current traditional compressional algorithm, these are the formats we have been using, which is the H 2 64 H 2 65, AB one B, P eight all these things can, could be for some, could be familiar or un, but. All of these traditional algorithm methods are being used to store all these MP G four files, MP fours, MP five for audios and NPG files for videos. And the algorithm which is being used at the backend for the compressions of the videos or these algorithms basically. So there are basically two types of compressions. One is lossless compression and the loss compressions. So loss compressions are the, mostly used format where when we try to compress data or we, I try to write and store a data in term of in terms of video, we do see some decrease in the quality of the video. And that's the loss. The loss e compressions unless lossless compression, which is more of the the high definition videos, which we talk about. But, most of the videos which you write or upload to any social media or any cloud storages, all of them are being compressed. And that's a lossy approach. And the approach, which we are going to talk about today is also a lossy based approach. Logies are not basically something where the data will be lost rather it's just the decrease in the quality of the video. But from a normal human standpoint a human, viewpoint you won't, you wouldn't see a lot of data losses, but granularly, if you try to see that there will be a lot of data losses happening. But that's the only thing which we lose. But we e eventually we try to compress these videos and store in a more optimal way. And so that's the lossy approach. And one of the approaches which we are approaching we are proposing today is also a based approach. Today in the ai machine learning space these are the, some of the areas that researchers are happening on the video compressions which is tapped by the encoders, VAE, which is a variant, auto encoders and deep con contextual network. So we are our idea is also based on a deep contextual network over here, and that's what I'm going to talk talk about today. For our approach which is the pixel point tracking approach to do the video D compression. Yeah, so this is a proposed method. So the way we have approached this problem is basically trying to avoid the rid then pixels in a video while storing. So just to give a quick example, so videos are basically comprised of frames. For any given video, let's see, even a five second video, there would be at least 20 to 30 frames, which we have, which are the, basically the images. So when you, when we see as a video, we saw a lot of things moving here and there as a video. But if you see as a frame. Frame by frame, image by image. We basically see the, there's a lot of redundancy happening from one frame to another frame, which is basically storing the same pixels or the same data from one frame to another frame. And that's what we are trying to avoid by not storing those redundant frames from one frame to another frame on all the subsequent frames. And, reduce all those un datas and we are trying to optimize the storage when we try to store using machine learning based approach. So the way we are approaching, or we way we are trying to find this redundant pixel is basically trying to track the pixels. So for example, when a video has been taken so that you know that a video moves in a particular directions, right? So it can move from left to right or top to bottom, or it can move from left to top or right to top or top to bottom. The video can move any anyways. So based on the moment of the videos using the coordinates, we are trying to understand how far it has moved from a point A to point B. Based on that, we know how far the new pixels we are going to get in the next frame, and what are the other relevant pixels we are going to. Get the next frame from the first frame, and we are trying to nullify those and trying to put a black spots on those by doing surveillance already the size of a frame or size of a image in the video. So there are two different approaches. One is single point trajectory method and the multi-point trajectory method. So the single point trajectory method is basically using trying to find arbitrary point and we are trying to track it. And multi-point is basically on trying to use multiple points and trying to track the video movement or the frames movement. So I'll get into details now of starting with a single point trajectory. So for a single point trajectory, what we basically do is we try to find an arbitrary point in a in a video in the starting first frame. And from there we try to basically track the point from one frame to another frame. And this is where we are using a deep learning, a computer vision approach. We're using a a concept called persistent independent prac particles, which is called Pips. It's an a research paper written by, another researchers and we made use of that over here and to optimize that pips concept in a way. And we are trying to achieve the video compression over here. And that's our approach basically. So we basically try to use a single point trajectory over here. Which is basically trying to find arbitrary point in the first frame of a video. And from that we basically try to track where the particular object or a point or a pixel is moving from one frame to another frame to subsequent frame. Based on that, we know what's the, how many coordinates it's moving. Based on that we try to see how many new pixels we are going to get, and that's what we store. We try to avoid the rest. So the single point trajectory basically works. Only in places where the objects are starting and the video is moving. So there are three different ways a video can be seen. So one is where the objects are static, where the video is moving or the camera is moving, or the camera, or the camera is starting where the objects are moving. It could be the both. A camera and object could be moving at the same time in a video. So there is three different approaches. The single point tracking can solve only in scenarios where the objects are static, but the but the camera is moving but Multipoint trajectory can do it. Way better by trying to have multiple pixel point tracking trajectories. And based on, we can try to get an average movement of it and we try to avoid the readable pixels. So before getting into Multipoint trajectory, I will just show you a quick example of how this tracking looks like. You can see here in this picture, there is a dog running. This is a video where we are trying to track the nose of the dog. So the idea is not to track an object here. The idea is basically to track in particular pixel, and based upon the pixel movement, we can identify what is the redundant and the non-redundant part of it. As I mentioned, this single point trajectory is achievable through an to a not approach. I would say like through a method where the objects are static, where the camera is. So that's what the single point tracking looks like. When we try to track it using mission learnings and we are trying to use the frames to see where the objects is moving. The pixel point is moving. So to talk about the multipoint trajectory as I mentioned, so Multipoint trajectory basically works. When when you try to put the pixel points, the trajectory points in multiple places in like a grid format, in a 2D grid format. And when by doing so, so now what it does is in a given frame let's say we have a eight by eight frame where you put 64 points or 64 trajectory points, which can track all the 64 positions within a grid where it could be a midpoint of every small grid positions within the eight by pixel image, so by a by eight image size. So by doing so, what happens is can also track. Object movement. It can also track the camera movements. And by by calibrating all of those, we can try to find what is the only T part, which is what are the non-redundant part which can we can expect in the next frame compared to the first frame. And this can mostly work in in a very advanced approaches where the videos is very complex and that's where the multi-point trajectory element lead. Be helpful to just give a quick visual of how the multipoint like would, will look like. Is from the previous video we saw that there was a dog and the dog was moving and we were just pointing one and, I can go back here. So you can see here, it's just tracking one trajectory and this tr looks like it's going like a it looks like a, some kind of leaf, right? So that's why, that's how it's moving. But in this case, in this video, we can see that the dog the object as far as the video, both are moving. So a single point trajectory would not effectively can say how the pixels are moving. But by using a multi-point trajectory. You can see here actually the how the entire video from one frame to another frame, or at least this is like a, let's say it's a two second video, might be we have four to five frames in this. We can see from frame one to frame five, how far it has been shifted. So this can give like an overview of how the pixels are moving at different positions. And based on that, every single positions we can see, what are the pixels we need to store, what are the coordinates we need to store? What are the quad, what are the position coordinates we, we can avoid storing and we can retribute from the previous spring. And that's what we are trying to do over here. So let's get into the steps on how we have achieved it. So for the compression step just to give a quick note, we in this approach we are trying to prove that this is possible to pixel point tracking by using a single trajectory method. So we have used that as a proof of concept over here to display displayed and show that this is possible. And so as a step one, using a single point trajectory method. We arbitrarily choose a point in a video. So we are using a video. We used a video where the objects are static, but the video the camera is moving. The video is moving. Basically in the video, the camera is moving, sorry. We use like a, let's say a midpoint, the in the frame. And and we are tracking that point over here. So we choose arbitrary point for the the pips to track the present independent particles model to track. And it basically process basically eight frames at a time. So to have, to improve the accuracy, we are trying to use one frame at a time to predict what bad the pixels are moving first, why that point is moving from one frame to so from first frame to second frame for the given pixel point. So we first, we place a random point. Arbitrary point in the first frame, and then we try to track the frame from frame by frame one frame to another frame. While we are trying to track. So let's say we were, we had a, we were at a point in frame one, and eventually when we try to move from frame one to frame one, to frame two, let's say mode four coordinates to the right, we know that there are, there is going to be. A with of four pixels. Four with of four size pixels going to come inside. And I know the rest of that part on the left side of the first frame is gonna be redundant and should be available. Same pixels in the second frame. So that's what we are going to nullify and direct delete that and store only the non pixels. And that's what the compression part does for one frame to the second frame. And similarly, from second frame to the third frame to so on. That's how we try to compress these. So just give a quick idea of how that looks like. You can see here in the frame one in the frame one, there's we can see that sorry. So the framework, we can see that over here, that the. This is the complete image and the, in the frame two, what ha what's happening is it has mowed like few pixels to the right. So when we are trying to do the compression, the second frame won't look like how we see the second picture or second picture over here, rather. It would look like the third picture in the image where it basically all of the redundant pixels and it stores only the non-redundant pixel in the storage for the second frame. Similarly, we do it from the third frame, fourth frame, and so on. And by doing so, we try as we are putting a black value of zero in terms of pixel to all the redundant part we reduce the size during the compression or this is how the compression part work. And now I'll get into how the decompression part work. So the decompression part is an interesting one. It's a very intuitive approach. So what we have done is so that in the process of this compressions, the first frame will always remain intact. That basically means like it could be whatever it is, the first frame, it remains the same. We don't because for the first frame there is nothing called render than pixel. That's the first frame. So first frame always stores the entire data that without with any of the data points in the pixel data points, which is so called the pixels. So now what happens during the decompression step over here is, so during the compression, we know that from frame one to frame two. Four, four coordinates to the right on the X axis. So we, what we basically do is that's this, that's the only new width, which we are looking from the second frame, and that's what we stored from the second frame and we nullify the wrist. So those nullified positions are stored as an array. On, on, on collecting those coordinates positions alone, which is basically the a, the X axis and the yxi points on basically let's say like a rectangle. So it takes four coordinate points, and we store it for a reframe on what has to be recomposed when we try to de do the decompressions. So basically we, when we do the compression, we store all the all the frames by storing only the non-resistant pixels. And then we also stored a separate array, which basically holds the holds the coordinates of the redundant positions, which can be obtained from the previous pixel sorry, previous frame. So how that basically would work is if you see here in the first image the data retrieval frame, right? So this is the first frame from the first frame to second frame. If you see the second the. LA where, wherever that shades, those are the non pixels. Wherever it's blank, it's the ENT part. So what it's basically showing us is over here from the first frame to the second frame. And it moved. It has moved. The point has moved like this, the camera has moved like this. So it has moved from this position to move like little bit of, little bit tilted towards downside. So those are the new pixels just coming inside the second frame. So that's what we are storing and we remove the rest of it. To reconstruct those second frame with all the pixels. I mean with all the remaining redundant part, to make it as a video at the end, what we do is we take the redundant part from the first pixel because we sorry. First frame. We know the first frame is completely, intact where it has all their pixels from the first frame, we take the T part and we which is basically the inverse position of the second frames redundant position. So the second frames redundant coordinates, will be the reverse of the first frames. Coordinates positions so that's what, because that's why the video is moved like this. So we try to take the redundant part of the pixels, try to reconstruct the second frame. So second frame will be reconstructed. Now, similarly, we know that for the subsequent frames, for frame two, frame three and until the frame, and by doing so, we do the decompression. We bring back everything again. So that's how we are approaching this and that's how we do the decompression. And this is basically work. We have done the POC on the single point tracking. So in terms of the results over here yeah. So what we were able to achieve is the compression we try to do is like a 15 times a 15 time per frame. So that basically means we are taking like a 15 milliseconds times per it takes 15 milliseconds to do the compression of per frame. And and it resulting in size of 36 kilo 36 kilobytes. That's. Per frame, which we are trying to compress over here, and it takes around 15 milliseconds per frame. So let's say we have a thousand frames it would be basically 15,000 milliseconds. So similarly for decompression, we are trying we, it's taking around 15 milliseconds over here. And as we are decompressing, we need to reconstruct the frame. So that's add some extra time per frame and it reconstructs back the the images with each frames looks like around 238 kilobytes over here. And that's what it has been used. There could be a question of why the compression decompression takes a 15 milliseconds and 15 milliseconds is basically the model when we are trying to do the compression, the model has to do the predictions aspect. So it has to do the predictions of where this pixel is moving from one point to the next from one frame to the next frame. So the model takes some. Few milliseconds. And also by, and also we have a builtin algorithm on top of it, which actually use those coordinates to nullify the nullify and store those redundant pixels into an array pixel coordinates into an array. And this just store the, and non-written pixels from the subsequent frames and so on. So that's why it takes 15 milliseconds and 50 like milliseconds per frame. And eventually this for. For like a one minute video, it takes around currently it's taking around close to a minute for a two minutes video. It takes close to a minute to do the compressions and store it and and and also the decompression takes pretty much close to two minutes to do that. So it's quite kind of little slow. The reason is we are trying to do the prediction per. But we can also try to do the predictions per eight frames. Let's say we have 64 frames in the video, and we can just do it in eight iterations by ha, by trying to predict each eight frames and trying to store it. But eventually what happened was the loss was more when we tried to do eight, eight frames, the prediction was dropping and eventually the data loss was increasing. We had to use per single frame at a time so that the accuracy is really good and the loss is less so that we can try to save lot of, data reduce the data losses and to trying to benchmark it against the existing traditional methods. And you can see that in this graph, right? So as we try to do the compression per eight frames, or per seven frames or per six frames, you can see that the the duction in the size is. When we try to do per eight frames, we are able to reduce the compression. We reduce the size like by 82%, but 82%. But compared to one frame, using using one frame for predictions by doing so we can reduce a lot of the size. And also it can be faster, but the problem is the reason that is able to reduce in size is because increasing number of frames to do the at time to do the prediction also adds a lot of losses to data. That's why we see the reduction in the total size when we do the compression. And you can see the right hand side graph where as we increase number of frames to use at time to do the prediction for all of the pixel moments. So if you use like eight frames, it just tries to. The arbitrary point and see where it is moving in the first eight pixels so that accuracy is going down because of that the way a working. The, rather than pixels, it also does the wrong thing because it eventually end up having a lot of data losses. And because of that, we can see that the number of the compression percentage is pretty high when we try this number of frames. But also the loss is also high. So then the righthand side graph, you can see that the loss is going pretty high when we try to increase the prediction per eight frames at. At a time. So if you try to do one frame, you can see the loss is very less close to 4%. Whereas if you try to use eight frames using pips plus it's around sound percent of data loss. So it's pretty high. Eventually you can see that the quality would be very not very good actually. So then you can see there's the reason for you to see that there are two lines over here. One is Pips and Pips Plus, plus. There are two different models. One is Pips. Which is initial version of pixel versus the version two of it, which tests better predictions on tracking the trajectories of the pixel point, which we are using to the, of the video to the. And that's why we try to use both the models to evaluate the performance and eventually in both the models using single frame per at a time to do the prediction does better. Because single frame instance, like using two frames at a time, so it can predict from one frame number one, to frame number two on how where the pixel has been actually shifted. And that's what the performance graph looks like. And we were able to achieve a better performance by using single frame predictions at the time in the model. So the loss was very much close to 4%. So where we were able to, 96 percentage of the the data and the loss is basically not like a visible loss where we can, where you will see black dots here and there. No, it's not like the visible, you might still see the video working fine. But but in terms of the quality, there's 4% reduction quality, but still, we were able to do that compression way better by producing the size of the storage of by 80%. That's something really great. Around 84, 80 4%. With just 4% loss in data. That, that's a good cutoff over here. And this is more of like an initial approach of using redundant concept over here to do the storage using ML based approach. So just further approaches and ideas for viewers who are listening to this we can try to use this on multiple point trajectories. That's what we are trying to work on for our next paper research paper where we are trying to improve this performance for more complex videos where we try to put their trajectories on multiple directions. And and the another approach is basically the object direction masking. This is basically works in places where, let's say in a given frame there are a human, a dog, or any kind of object. It can mask all those objects and understand the pixels in a very more very smooth way. Like it can put a mask on top of it and I can try to identify the same mask or the same person, the second pixel, and can eventually try to avoid those t pixels in theum frames by masking those object, masking those objects, and then other approach similarity. Such. Similarity search metrics is basically can be used where if you see any similar pixels which already available in the previous frame compared to the new frame. At every pixel level using similarity search metrics, using any kind of cosign similarities or any kind of dot product similarities, we can try to see how close these pixels are and we can try not to store those pixels, the subsequent frames so that we can reuse it and map and reuse these pixels in all those places when we try the decompression. So these are some of the approaches which we can try and open to anyone can give a try on this and try to see if we can come up with a better approaches or better solutions. Yeah, these are the future approaches and I'm I'm hoping machine learning and these AI models not only uses a lot of data to train themselves, but also I hope that gives a scope for us to use AI to also reduce, data storage because a lot of data storage in today's world is being used to train these machine learning models. In return, I hope these machine learning models can also contribute in a way where it can store things in a optimal way and reduce cost for us. Yeah, so that's a that's a good takeaway out of this. Talk that AI not only uses a lot of data, but it can also help us optimize these usage of data, a storage of the data, and this one such approach, which we tried and yeah. And. That's all. And these are some of the references. That's the research paper which you can look into and repose there. And these are some of the other references which we looked into to inspire from them and to work on these optimization approaches using deep learning methods. Especially we use the deep convolutional network methods over here. And that's what this video is all about. And I hope you all enjoyed my. Talk and feel free to reach out to me if you have any questions. Would love to answer and thank you.
...

Hitesh Saai Mananchery

Senior Machine Learning Engineer @ Expedia Group

Hitesh Saai Mananchery's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)