Conf42 Rustlang 2022 - Online

The art of programmatic videos with rust

Video size:

Abstract

This is going to be a release of https://fframes.studio. I’ve spend mostly 2 years to actually built it and dived into the video programming a lot. Videos are actually a sequence of images properly encoded. But how you can make your own video from scratch with only the code?

Here is a list of things you will know after attending this talk:

  1. How videos works under the hood? What are codecs and how they work?
  2. How to make a video with a code?
  3. Why using rust? 3.1 Interop with libav aka ffmpeg 3.2 Memory efficient frame rendering 3.3 GPU support
  4. Rendering frames problem. Browser vs Rust in svg rendering
  5. Audio creation remixing and blending
  6. GPU for video rendering

And also you will have everything to start making videos programatically with Rust. Gonna be 🔥

Summary

  • Mitri: Today, video are the most popular type of content on the Internet. People are using really weird software today to create videos. Rust is probably the only correct way to work with audio or video, he says. Mitri explains how you can work with videos in rust.
  • Today, surprisingly, browser is still the most popular way to render any kind of static content. We are trying like ive been trying to find a way to make the frames and the image rendering more efficient. To do this, we need a format that is dived.
  • You can render SVG in rust directly without any kind of problems. The most amazing part of this project is a completely from scratch created GPU renderer. It's still not 100% working, but it's amazingly fast.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everybody, my name is Mitri, I'm from Ukraine and I'm working at Lightsource AI as a software engineer. As a rough lead software engineer. Little disclaimer, if you love big money and you are a brilliant rust engineer, you are very welcome to try out and be a part of late source. But today I'm going to talk but my a little bit different passion. It's multimedia programming and especially video and audio programming. With Rust, ive been working on a project that allow you to create video for one point ive years and things. Video is actually my journey of creating videos and audios with Rust. So at the very first I am going to start from the scratch. What is a video? Today, video are the most popular type of content on the Internet. You know that today Internet consists of like 90% of the videos with kittens. But in fact today people are spending enormous amount of time in the watching videos by stats IO in general, general user of Internet spending more than 10 hours per week watching videos. Not so much you say, but that's actually dozens or hundreds of billions hours of watching videos per week. That's a lot. But technologies behind the videos are completely outdated. I even want to talk about the adobe after effects or professional software written in c that can be replaced with more safe rust. I'm actually talking about what real users are using to create videos for TikTok, for example. By the way, did you manage how we appear in a world where vertical video is the correct one? I didn't yet, but people are using really weird software today to create videos. They are using mobile apps, mobile video editors, mobile graphics, video editors to create videos, and they are really, really far away from being efficient. When I firstly understand that people that the most popular video editor renders video through the web, through the browser, through the HTML and CSS, I was like, oh my God, I need to somehow fix this. And it's actually a great time for Rust because Rust today probably the only correct way, I won't be afraid to say this, to work with audio or video, because when we will start figuring out more deeply about what is a video, you will notice that the main problem of video is memory and garbage collector. Languages that will handle videos or audios in runtime is mostly impossible. But today we will start literally from scratch. And I will try to describe you how I did and how you can work with videos in rust. So let's start from what is a video, like what is actually a video file and how it works. And if we will take the most common container for videos. It's the mp4 and mp4. It's not a codec, it's only a container for media. Input can contain different type of codecs, information for different types of codecs. And the mp4 file actually contains from the images stream each image. You can see it on your slide. On the slide right now has the presentation and decoding timestamp because, I'm sorry, because encoded itself contains of really depends on the ordering of frames. You can have the frame, for example, you can see the presentation frame is 01234. But the decoding frame is different because the encoder can use the information from the next frame in order to decode the current one that's in the grant schema scene. But you can understand right now that each pill from this graph is actually a separate image, like PNG or RGB, actually like in a different format. But we will talk about things a little bit later. And also we have an audio in the video and we have also the audio stream here. The audio stream consists of also the frames. The frames are a little bit different because the audio consists of samples. Sample is much smaller in amount of time. We record relevant 40 48,000 samples per rate. And we also capture the frames of the audio and connect it with the same architecture of presentation and decoding timestamps. And that's actually the base for all the codecs you can find today on the Internet, on the open source, everywhere. Because decodec itself is mostly about the mass, it's about how to make the information of the specific images sequence to eat less space from your hard drive. There are a lot of codecs today you may know, but like Ampec, four half AV and a lot of different codecs, some like with transparent images, some is better for streaming, some better for web, some better for professional video editing and used mostly by the cinema creators. But most of them are actually like the giant specs of the mathematics that are used inside the codec to make the video more light and in order to more easily transfer it through the net and the stream and so on and so forth. So for example, the most popular codec, the second most popular codec today, and probably the one that should be used everywhere, it's a hefk like high efficiency video codec. It's h two six ive. It has specification over 700 pages. And there are plenty amount of implementation of these codecs. And you can use it, you can use the specification to build your own codec. You can use an open source implementation of the codec, but actually it defines the mass of how, for example, when I'm moving somewhere, you can see the video should not save all of the frames of my movement. It's actually enough to only capture the position and where the position will be like in a second and then render it in. It's perfectly enough to capture the one position, the another one, and then like render something in between. And that's actually what codecs are doing mostly. But you likely as a user or developer who makes video with code, don't really want to implement 700 specification by yourself. And that's where Libav helps. Libav, it's an open library. You may notice that a C library for audio and video, and you can use it. It's actually an abstraction over all the codecs you may actually find on Internet. It supports literally everything, all the audio, all the video codecs and so on and so forth. And you may know it by another more popularly used name, it's Ffmpeg. Originally Ffmpeg was a CLI wrapper around the Libav, but right now it's maintained it within one repository. And Libav is actually part of FFmpeg, but doesn't really matter. But using FFmpeg you can actually render any kind of video from any kind of image and audio source. As we noticed at the very start with literally one comment things is perfectly enough to generate a video. So basically, like we said that we want to have a video with 60 FPS like full hd, have an input of a seconds of n images. So basically on your file system you can have pick then the number of the frames PnG file that then will be decoded and transformed into the frame using the h 265264 codecs and specific pixel format. And as an output you'll get the test mp4 file. And this actually works. And that's how today most of videos from the free open source, not open source, free for use editors on mobile and on web are doing. But that's really really far away from being really efficient just because we don't really want to waste a lot of time on rendering images. And we can do it directly because Libav has the public C API, we can use it from rust in order to make the manual encoding. How it works pretty easy. You can have the image, an image will be the BMP image, it's like raw image sources. Then you need to convert it to YuV image. We will talk a bit later about it, then you send it to the encoder, which is like implementation captured by Libav you will get a packet. Packet is a compressed frame of the video. You assign a specific timestamp when you want it to be encoded, decoded when you want to be read, and then you put it into the media container, the same you are doing with the next image. And the same way you get a new packet and send it to your file. The only problem, that you must encode all the frames one by one in seconds. Just because that's pretty straightforward. That codex are using specific math to understand how they should encoded the images. And here is more about this YUV image, which is like it's actually a legacy from the past times. And that's very interesting story I would like to talk about a little bit. So basically, igbric u by images are instead of like RGB color channels, it's just a different representation of colors where you have the luma or the brightness channel, which actually defined the black and white, and the two additional coral color planes. That a result gives you the perfectly colored image. And that's actually a legacy from the analog television. When companies television companies face the problem that they need to support both the black and white television and the color television, instead of replacing completely giving the complete breaking change or putting the new three cables, they managed to create a new algorithm that allow you to send to use the same first cable and two additional one to create the colored image or still fall back to the black and white if you need to. Yeah, that's like kind of neat solution, kind of neat engineering. I'm sorry, kind of neat engineering solution. But as a result, right now you need to do something like this for each frame of your audio. So basically from pretty any kind of image, when you will render it, you will need to create a loop over all the pixels of the image, and you will need to make a loop over all the pixels of the image. It could be like for 4k video, it will be the loop over 8 million pixels and convert it using specific mass. But to be honest, uv takes less space on a hard drive because it can contain less information for color planes. But when you know this, when you know how to convert the image from the RGB to the UIV, and then you can send it to encoder and then get a video, you know that you can create a video. And here is a problem. You need to render images. And that's an interesting question, how to render images. That's a problem because today, surprisingly, browser is still the most popular way to render any kind of static content. There are a lot of developers, a lot of front end engineers that are producing a lot of content, and they're using browser literally for everything, even for video rendering. And that's becoming really ridiculous because if you will, trying to find the similarity, trying to find the most common over these two images, you will likely fail, I suppose. But in fact you can see that on the left side you can see the streamer that has some fancy background, some notification appear on the screen with animations, progress bars, and on the right side you can see something completely different. You can see the GitHub preview images for social media graph like open graph, but in fact appeared that both of these graphics is rendered on the browser. The Ops, the open broadcast software is using browser to render all the animatable content for notification and everything, and all the plugins are using it. And the GitHub is also using the headless browser to make this open source graph previews, which is really far away from being efficient. But we are trying like ive been trying to find a way to make the frames and the image rendering more efficient. And if we're all thinking about this, we appear that we need a format that is dived. It must be like because the image has a fixed dimensions, it must be fixed, it must be animatable, easy to animatable, it must be like friendly to DX, it must be easy to understand, it must support GPU for fast and efficient rendering, it must have a specification to render it and to understand by a user. The most important part, it must be debuggable, which means that nobody wants open source, nobody wants to cover or trying to understand the drills of your source code, debug it. To understand the problem, it must be easy to figure this out and to write and to render them. And it appears to be a really hard to find something like this that covers all of these criteria. But surprisingly, there was always, not always actually, but for a long time there was a format that is perfectly fine, meets all of these criteria, and is much more like much better than pretty much anything else. To render any kind of static content. It's SVG, it's scalable vector graphics, and it's used everywhere, especially in a web. To render any kind of like to render it may be confusing, it may be like horrific. These puzzles are always horrifying developers, but in fact SVG is pretty self contained format and it allows you to render literally everything. Like for example here you can see that we are rendering the SVG using the macro of the rust. This is a public API of my framework. We are coming more and more to the actual demo of my framework. And actually you define the SVG file. You define a rectangle that will be a background because it have the self width and self hate. Then you define a simple animation. Here you can see that from zero to the five second, you will translate from white to some other color. And then step by step have another colors. Then you have a text with specific form family on a specific coordinate, and have you another text. And it, ta da da da, appears to render something like this. Things is the editor of Frames. And yeah, welcome to my framework. That's mostly the first public demo of my framework I'm working on, but that's how it works. It renders the SVG, it renders timeline, and bonus, it actually renders an SVG. So it's super debuggable, super easy to preview, easy to construct, and easy to understand format. And what is even more important, that is specificated and pretty widely popular because like in figma, you can literally construct your frame, then run like right click on a frame, copy it as SVG, pass directly to your rust macros, then use any kind of rust expression inside of the frame definition and you will get the preview of the video. That is like the really important part of making videos. You must make the progress of making videos really fun and really smooth. And with rust it's really possible, because how are we dealing with it? We have the video definition, you seen this with an SVG definition that is internally transformed to the ast sends to the Wassen bridge, which sends the correct frame to the editor app. The frame definition can use some APIs from the code of frames. By the way, it's the name of my framework, like animation, like subtitles and so on and so forth. And we create the editor app, which gets the SVG and shows in the correct time. It's pretty simple. And we have also the renderer leap that takes the same video definition and creates the images from SVG, sends them to the encoder, and then gives you the real world file. And like creating the WaSm bridge in Rust is really simple. You just define a macro and you have the completely working WaSm based editor that consumes the video in like 60 FPS for easy. But right now you need to also render your svgs, which may not be really easy task to do at a first glance when you just try to figure this out. Because SVG, despite being pretty popular, it's still web based specification and it's really hard to render it. But thanks to the perfect rust and the awesome rust community. When I just started the project, there already was a library for SVG rendering. It was created by Rosario Falcon and it has like 1500 tasks that covers pretty much all the use cases of SVG specification and makes it even more precise of rendering SVG than the chrome browser. That's impressive actually. And as a first glance, I just completely depend on this library to render images. Right now I have my own fork of the library to make it more efficient for sequential rendering to not rerender some part of the svgs when they are not changed to more efficiently process the reusability between the frames. But in fact you still can render SVG in rust directly without any kind of problems. But the problem, it's rendering on cpu, which means that cpu is pretty much bad idea for rendering pretty much anything. But from the flip side, when we're talking about automate tool that make you a programming that actually automates video rendering by giving some input, you get an output of the video file. It appeared to be like the best case because nobody will actually create the infrastructure with GPU, which is pretty expensive just for some kind of one feature. Like if you have the professional software, it's another call. But right now it's still pretty efficient though. If we are focusing on the rendering and we are doing a clever simplification of SVG, we're doing this ahead of time. Because you don't need to support all the effects of SVG, you just need to have the poses with a specific color scheme. And for the video that I show you a couple of slides ago, the hello world video, you probably remember it. It did like pretty great job. You can see it's not like speed it up. It's a real world performance of rendering pure cpu of 9900 frames, full hd of videos with completely on cpu. Thanks to the rayon and Ross like parallelization and compile time optimizations, when you're avoiding all the copying from the processor and your rendering is paralyzed across all the cores like I shown here. So basically, the idea of CPU based rendering is that each of your cpu renders specific file, because as you remember, video cannot be filled with the frames like unordered frames. And then we prepare the video in that case to be easy concatenatable and then just concatenate it in the end. And then we have the pretty much performant way of rendering pretty much anything except the things that breaks performance really hard, like the shadows, the gradients and the blue are the killers of CPU. Just because the nature of CPU that you need to process each pixel, calculate the position one by one. Things that require going like several times or smoothing a big amount of gradients, smooth a giant amount of pixels to calculate the specific precision per each pixel kills the performance completely. And here is the most amazing part of this project is a completely from scratch created GPU renderer. It's still not 100% working, it's still not 100% compatible with SVG stack, but it's amazingly fast. But our cpu render is still very fast though you can see we will compare the pps, the frames per second, render it without the encoding over CPU and GPU. You will notice that hello world video due to the fact it's pretty simple. It renders only text and only color, but still looks pretty nice. It renders really fast because of parallelization, because GPU has only like one, it's less parallelizable across the different frames. And when you parallelize it over the images, it becomes slightly faster. But when we increase in the amount of the resolution, you are getting pretty much same results. And for the blurs and gradients, when you have some kind of gradient, full size full page gradient that is then blurred, it requires six times going through all the pixels for pretty much all the background, it becomes like much faster. How it works pretty interesting question, because the fact that we are simplification the SVG to the paths only we can do the tessellation. That is like the algorithm of parsing the positives, parsing the dots and the vectors into the vertices and indices, and give it directly to GPU and get the result, the rendering result directly right from GPU, which is much faster in terms of effects and shade ring. And we have not so much time left. I don't know how I appeared in things situation that I didn't have enough time. But we also need to have a video, the audio inside the video. And because the fact that we can work pretty flexible with images, we can do exactly same with audio. Like we can even generate the audio with a mass. Because we know that audio have a sinusoid and frequency nature. We can generate some sound right from the code. But in fact what real user needs is to have something like this. Nothing more than just remixing the audio. It's a preview of the images, but in fact you get pretty much the same result in an audio file. And you actually need to load the image using the encoder, the decoder, load the audio and then mix it. But what is really awesome is that you can define, yeah, here is how you can define the audio map with frames, then it will mix all the audio and mix it into the one and put it into the file. But what is more interesting, that by the fact that we already have the audio data in memory, we can provide an API for users to create the audio visualization. And that was actually how the project started. I had a podcast that time and I always didn't like podcasts, because you have always a problem, you don't understand who is talking right now. But with frames and with really not so much code, with just API that gets the vector of the frequencies and then render it in using rectangles, you can get the pretty much awesome looking visualization with pretty much any kind of design and render it like 1 hour video within 15 minutes with AFO frames, without any kind of additional GPU usage, only on CPU, which is pretty impressive. And I think that we have not a lot of time anymore, but I can say that what I learned for one point, ive years of developing the video creation framework and the videos are really interesting. And if you think the same, this is the animation of f of frames. I'm really grateful to invite you to try, but the effort frames, because just now it came out to beta testing. Yeah, right now. Yeah, I know, that's amazing. And today, starting from today, you can sign up to our discord and just put your GitHub name to the beta channel. Like you can sign up for this cup either from this QR code or go into frames studio and try out the internals of this project. Try out to play with CPU GPU rendering, create pretty much any kind of video and automate it whenever you want to. I'm really. Thank you for watching this talk. Hey,
...

Dmitriy Kovalenko

Lead Software Engineer @ Lightsource

Dmitriy Kovalenko's LinkedIn account Dmitriy Kovalenko's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways