Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Hello everyone.
My name is John Komar.
I'm a former senior engineer at Google.
Currently now work at Meta.
I have around like 13 years of experience working in different
stacks of like software development.
I'm mostly a full stack development, but my current recent experience has
been mostly working on Android native applications at Meta slash Facebook.
So yeah, today I'm going to present, by white paper, which is
pre-computer surround lightning for like reload latency.
And I hope we gain something useful out of this in this conference.
Yeah, let's get started.
So of the main things which we are going to cover in this conference
and this particular presentation as what is the system architecture of
this system what are the performance advantages of using this approach
compared to the conventional approaches?
How does predictive modeling help make this system more performant
than the conventional ways on how sound lights are handled?
What are the user experience advantages of using the system and
what are the challenges which I faced when I was writing this paper,
as well as developing the system?
So one of the major challenges in the existing system of ambient lighting,
so I'll give you some context.
Ambient lighting.
When I say ambient lighting, it is generally when you are
watching something on your screen.
Let's say you're watching a movie or you're playing games, you can
have these like colorful surround lights set up in your home.
So what's of the new companies, for example, let's say Phillips and a
bunch of other manufacturers now.
They have released these smart bulbs, which can sync with the video
content being played on your screen.
So the usual conventional way of doing this is they would have a
camera pointed towards the screen.
The camera would see what is being played on the screen, and
then there's a small box which is going to process the information
which is happening in the video.
Then it would send out commands to the different bulbs or the different
lights set up across the room to change their colors to match in the screen.
Now the biggest problem in this entire approach is latency because everything
is computed on the fly, especially like camera, seeing what exactly is going on.
Then there's this like computation, which is happening on the device locally.
What they are seeing.
Then deciding which colors make sense.
Which brightness makes sense, and then sending out the signals to the bulbs.
This entire process is heavily computational expensive, not to
mention the added latency and passing these signals usually over ZigBee
to these lakes around smart lights.
And because they have to do this on the runtime, there are limited number
of colors or limited number of like brightness numbers these machines
can play around with because the more computational heavy they make it, it
is going to add more to the latency.
So this entire system is physically bound by the challenges of physics,
basically how much time it takes to process certain amount of information,
and how much time it takes.
To send those signal to smart lights or smart bulbs or like different
light strips to solve this problem.
What I came up with this approach was to have this pre computation or basically.
Pre-processing what content was being displayed on the screen
via the signal pass through.
So this approach does not really rely on a camera or a physical device
looking at what's going on in the screen, but basically trying to
pre-process what this scene might entail.
And this predictive buffer basically predicts what the lights
in this thing are going to be.
I'll explain this with an example.
Let's say there's, you're watching a movie and there's a police car chase theme.
Now, we know in most countries, like the police cars have red and blue lights.
So this system would pre-compute that in this particular car chase scene, the
scene would require blue and red color lights flicking through the system.
So the system can pre-compute and pre-prepare these like encodings
and send these signals to the, but right before the scene starts on
the screen, so the users would not.
Experience any latency as well as immediately the signals will be
passed onto the surround lights as the screening is being posted onto the on
the main screen, what happens in the conventional systems as I have personally
witnessed that myself, even though when the police car chase seen is over.
Is when the bulbs start reacting to those scenes like you would
see blue and red colored lights.
I'm just giving examples, would flicker corresponding
to the police car chase scene.
But the scene is no longer even on the screen.
So this whole experience is like really jarring.
The latency is like extremely high and this entire experience is supposed to
be more immersive and more engaging.
But in my personal experience as well as the server which we have conducted
with internal test users, this entire experience of the conventional way of
doing this thing has been, I would say, completely disconnected from what you're
watching on the screen versus what you're experiencing in the smart lights.
So this approach of like pre computation, not only bridge that
latency barrier, but also gives you a lot of flexibility to play around with.
So in this case we did some benchmark analysis and we found out that even
though the conventional systems had the latency of around, let's say I think it
was around like 200, 300 milliseconds, but with our approach, we can basically
send out the signals with the overall latency of 60.7 milliseconds, which is
almost like a more than one 10th less latency compared to pen menstrual system.
And also because we have the capability of pre-cutting this information
regarding the scenes in the movie or games you are playing or the
TV content which you're watching.
This could be even be streamed or d platforms or online streaming
like Netflix or like Amazon Prime or Disney Plus, et cetera.
So because this information is like pre-processed, the system
has the capability of peeking through the frames, which would be
visible in the next few seconds.
So with this look ahead approach, not only the system can pre-compute
the light signals it has to send out for the currency, it can even
do a look ahead for the next scene.
So that not only saves the time for pre-pro when the scene actually comes,
which avoids this jarring experience in case the stream buffers or like loads, it
also creates the smooth experience that when you transition to the next scene.
The lights are already pre compute.
It completely avoids even the possibility of having a delay because we already
pre buffered what the smart lights would be showing in terms of color
or brightness, two seconds before the scene was even on the screen.
And also because of this pre computation, it can support like multiple frame rates.
For example, let's say you are playing games.
The frequency is not really around like 24 60 hertz.
It can even go up to one 20 hertz, even new monitors about two 40 hertz.
So with this approach, it has not capped by the frame rate of what you
are watching, but basically human time of how much few seconds of
content which you're going to process.
For example, if you're playing a game.
The game developers in their gaming engines could pre encode the signals
of the colors the game environment has.
So e, because the character can move in like certain directions in the map, they
can pre-compute that if character moves, let's say the direction of snow, we can
make the entire room like snow white.
Or if the character moves towards the sea, we can make the entire room blue.
So all this pre computation can be done, and also game developers
can pre-code certain information.
Which the system can handle and directly send the signals and make sure they're
ready before the scene actually comes.
Also we developed a system for error collection.
So for example, when this theme is being shown, or like the predictive
analysis being done for the next few scenes it can pre-check itself
that the colors it is sending to the smart lights actually makes sense.
I'll explain this with an example, for example.
Let's say a character is looking at a beautiful rainbow in the sky,
even though the primary color in the ceiling would be the blue sky the
conventional systems in that case would actually show blue lights in the room.
But actually what the user is actually looking at is the rainbow.
So in this particular case, with the machine learning algorithms.
It can predetermine, what is the main subject and main
focus on that particular scene?
In this case, it's a rainbow.
So with this new system, it can actually send rainbow colors, which could be
spread across the entire room and make the scene more immersive compared
to existing lights, which would only show I would say maybe sky blue.
And as, as I previously mentioned, the conventional systems because they have to
rely on this like real time processing.
They have to rely on the camera input.
They have to rely on sending the signals.
On the fly, they're bound by the physical limitations on
how much they can pre-computer.
So instead of using all in the entire color gamut, and instead of using
the entire brightness spectrum from like zero to a hundred percent, let's
say, they would play it really safe.
They would only play with like major primary colors.
The systems are really risk adverse that they would not tinker with the brightness.
They would not tinker with like too many color combinations
and risks of getting it wrong.
So to play it they usually play within a very safe range of colors, but in
our system, the new system, because it is pre-cutting and it has algorithms
and it has checks and balances.
To make sure that the scene and the light signal it's sending is actually correct.
It can actually play with the entire color spectrum of all the million
color combinations it can produce from RGB different combinations of RGB
lights, and it can also go from like zero to a hundred percent brightness.
For example, let's say you're playing a horror game, it
can actually go completely.
To create that immersion that you're playing a horror team or horror scene.
Or if you are watching a really nice, let's say the latest Superman movie with
like bright, beautiful colors, it can go to a hundred percent brightness with
like really immersive bright colors to match what the content creator intended
the movie or the content will look like.
How did we achieve really sub hundred millisecond response time.
These are the key areas which we actually focus on.
The first one being like really efficient frame processing.
If we do not have a good frame processing algorithm in place, or we do not have
good pre encoding logic in place there is no way getting around it if you
are not able to find the colors that a particular screen should be showing.
And if, because this is going to the bottleneck, just processing the raw
content or the video content and then the language colors to show the other areas
in this system are basically optimization.
Basically, pre computing is basically applying this frame, processing to few
frames in the future, making sure that you have enough buffer when those scenes
come onto the screen, you already have.
Those screens early.
No communication.
Basically talking between the smart lights because your system knows
how far those smart bulbs are.
Let's say your smart bulbs are really far spread out in
your huge home theater system.
Or let's say the system is even being used in a commercial theater system
because the physical limitations that the birds or the lights are.
Located really far away, you can pre calibrate for that LA agency because
you know that bulb is the time.
It takes one second to act the signal to reach to that bulb.
And because we have pre computation, you can actually send the light
signal for the scene, which is coming a second later, so that it is
completely in sync with the video.
So even, and this would have been, and this actually is completely impossible
in the current system because the current systems rely on the camera.
They cannot please send the signal to the bulb.
And if the bulb is like really far away, or even if there's a network
lab, let's say your outer is not super efficient and there's a delay between
your setup box or your, let's say, smart device between those bulbs,
the latency is really pronounced and it's like really bad user experience.
And by using dedicated systems for pre-com computing this.
The better hardware we have, the better GPUs we have.
This system can be scaled further in terms of depth, basically reducing the
time it takes of frame processing as well as how much pre computation we can do.
Moving on aside from these approaches, we can further improve
the system to improve the overall immersion by identifying the scene.
For example, in this the previous example, which I mentioned, for example,
rainbow in the Sky classifying the scene that this, the image which the
user is looking at right now, the scene user is looking at is rainbow in the
sky, makes or breaks what the signal is going to be because existing systems.
We just rely on what is the maximum amount of color or what is the maximal present
color in this scene, which is going to be blue, but in our scene classification
system, it is going to actually recognize this is not just a blue sky.
It has a rainbow inside it.
That way it can completely differentiate which kind of color
signals to stent to the smart poles.
It can also do color mapping because I mentioned previous systems.
They rely on camera input if your screen does not have color accurate.
Let's say display or the display itself is not of high quality.
Or let's say you are watching content on CRT or some other data hardware.
Their systems also fail because they completely rely on what exactly
is on the screen, not what signals are being sent to the screen.
We can also do motion analysis.
Let's say your monitor has really high frame rate, one 40 hertz.
And because the exist existing systems that rely on camera, the
monitors frame rate could be far higher than the computational
capabilities of those systems.
In that case, the seams could actually be changing faster than what their
systems can compute and send to the buds.
In that case, those systems have so much lag.
It's almost impossible to keep the video or the game content
in sync with the smart lights.
Also with focus election, it is similar to scheme classification, but also it can
identify if there's like a main accuracy or if there's like a some emotional theme.
It can also determine which areas to focus on Embassy itself and.
With the adoption of the system over time, content creators movie producers,
as well as other people who are in the department of color mixing, or like this
video engineering, they can eventually employ pattern learning algorithms on how
accurate the previous predictions work.
And this system can go grow more accurate over a period of time.
And this entire coding could be very similar to, let's say,
how some engineers work today.
Whenever we watch a movie, there are in the background.
There are like so many sound engineers who have watched that content, who
have specifically chosen that this sound should go to studio speakers.
This sound should go to the roofer.
This sound should go to the AT speakers.
Very similar when there is an entire ecosystem built around the
sound part of the video, we are watching eventually down the line.
I believe we can also have this like surround part of.
Different smart lights reacting to the video content.
So the major application of this particular approach is going to be giving
immersion gaming industry, I believe has already crossed in terms of profits.
The other like sports in terms of the money it makes.
The game is getting very popular, be it like phone games, like consoles or pc.
So having a more immersive gaming experience, basically the game which
you're playing your entire room reacts to.
It is going to only add to the overall experience of the users, of the gamers.
We can also adopt this approach not only for gamers and like
cinematic movie watchers.
But also for education content as well as reducing the visual comfort.
Sometimes for people who have reduced impaired visibility, having lights in
the room which react to the content could make the experience somewhat
more pleasant as and soothing.
And also people who are not able to focus entirely on the minute details on the
screen, they could still get the overall experience or overall, the vibe of the
movie or the vibe of the scene, just simply because of the surround lights
reacting to the content on the screen.
Some of the optimizations, which could be done specific to the content types.
Basically as the gaming progresses we have moved on to having 60 hertz
being the benchmark of smooth gaming.
Now, one 20 hertz is pretty common, and with the advancements
in new GPUs, new monitor hardware as well as virtual reality, it
is only going to keep going up.
Since our logic is not really capped to the.
To the frequency of content and based on the pre encoding and the
pre detection of what the scenes are going to be, our system is going to
easily scale with the improvements in gaming over a period of time.
Cinemas I believe have been very stagnant in terms of the exp movie
watching experience in the past.
I would say 30 years.
Aside from new IMAX and sound formats, they haven't been really radical
improvements in the cinema watching experience in the past few years.
Now with this new system, the directors and the sound engineers and the new
field of I would say occupation maybe the sound light engineers can actually
include this new information that corresponding to this movie scene.
The theater lights should react in this way.
The backlight should react in this way.
If it's a horror movie, the lights should flicker in this way.
For example, imagine watching a movie which has.
A thunderstorm scene and the entire theater lights flash according
to the scene on the movie.
It is going to create a much more immersive experience compared
to the movies experiences we are used to watching right now.
And also with the increased short form content people's focus duration
has gone down significantly.
We are all aware, like I think the studies say the average focus time now
for people is around seven seconds.
Which is alarming.
But with these surround line lights and the surround encoding of content,
people might be able to focus more as it can reduce the distractions
and people can actually focus on the screen, which they're supposed to.
It can maybe di the lights around the room when you are focusing,
let's say, on your coding tasks or you are trying to read your book.
So it'll not be just limited to the videos, but based on the content
which is on your screen, it can pre-process and can determine like
what work you are doing on right now.
And it can put you on focus mode.
It'll also be you watching like maybe some calming videos or listening to
soothing music and the lights in your room could react accordingly, making
the experience much more deeper.
The architecture, which I briefly touched upon in the previous
slides we can go over that.
Basically the system is based on the content input, what the video
or the audio content is actually is.
Then we have this like frame analysis engine, which.
Which sifts through this content, which decodes this content and breaks into
this, like what the light should be.
Then there's this predictive buffer, which is an optimization,
which predicts what the next few frames are going to look like.
Then there's this node controller, which is responsible for sending
out the signals, the actual signals to those surrounding lights.
Could be strips, could be bulbs, could be any even virtual devices.
For the sake of testing, and eventually they could be replaced with actual bulbs
and then actual physical lights, which are responsible for showing the actual colors.
Some of the major challenges, which I encountered while working
on this particular project was the latency reduction.
As I previously mentioned, tradit traditional systems, they completely rely
on camera input, basically their systems.
Watch the video and translate the way humans do it, which I believe is
really unoptimized way of doing this because computers and we have like
algorithms which can pre-process this video or format, we do, we need not
actually rely on what's on the screen.
We can actually pre-compute software information and we can
even predict what's going to be the information in the next few seconds.
Also, the way the content and the frame rates change so
rapidly in today's environment.
Some of those existing systems cannot just keep up with the way the technology is
advancing the way Movie scenes are much more immersive, much more color depth.
The games have accept a much more higher framing rate.
The hardwares that would be built to keep up with these technologies
are going to be extremely expensive and would not be consumer friendly.
And the existing systems, they really have limitations on how much computation
they can actually do while making the devices affordable and how much
color accurate, how much brightness, accurate they can actually be with
the technical limitations they have.
So the implementation considerations, which I made when I write the system.
Basically the systems could scale from a casual TV or home theater watching
person who has some sound lights and could scale all the way to a commercial
movie theaters where people go to a movie theater and actually have the
system running behind the movie and controlling the smart lights around.
You.
Imagine like a four D movie, but the chairs would not be shaking.
But still the experience would be much more immersive and.
The best part about this approach, it can convert any of the existing theaters.
Into these smart theaters which correspond or which respond to the
content on the screen with very minimal additional expenditure.
Converting a regular theater into a 40 theater is going to be significantly
expensive compared to this approach in which you have to spend like maybe a few
hundred dollars to install these smart lights and all of the internal logic
to pre-compute how the signals will be sent to b will take care of it itself.
And this system could easily live inside a gaming environment.
Let's say you have a PS five, you have an Xbox, you have a computer,
if this information is preen code.
And all of these devices have enough hardware to handle this
on the device, live in coding.
And these could easily handle sending signals to your smart
devices existing in your home.
So this can scale from like entire movie watching experience to
casual, to professional gamers.
Who consume live visual content.
So I believe this approach could be the next redefining step
in redefining how we consume.
Media, basically video and audio content because the performance
is going to be extremely fast with predictive intelligence.
It can keep up with new contents, it can keep up with new games,
it can keep up with new movies, and it can keep up easily.
Keep up with new video formats and hardware, which is not really possible.
And same systems with the advancement in surround lights,
better lights, better, smart light bands, light strips, light lamps.
And the experience is going to keep getting even better.
And as the system scales, it can handle like more and more lights, maybe even
like a hundred supplies in the future.
And because it is like pre-com computing all this information and it is being
computed once, not everyone relying on the system or using the system has to
compute this information on their device.
So movie producers, content producers could pre-compute
or prefe this information.
And aside from the video, like subtitles or surround sound input,
this could be a new input and the user's devices could just consume
this input and send that smart lights.
Overall, this would reduce the dependence on consumers and having
to buy really expensive new hardware.
For example, the current hardware.
Which does something similar in a very inefficient manner is at least two $50.
Not to mention the cost of buying additional bulbs with
and additional cameras which go out of it like really fast.
I hope this new system sounds useful and hopefully it will be
eventually adopted by the industry.
I have written defensive publication for the same approach with the same name.
It would be good read if you guys want to find out, and I hope you
have fun and rest of the conference.
Thank you so much.