Transcript
This transcript was autogenerated. To make changes, submit a PR.
By the time my talk will be over on, almost 10,000
hours of content will be uploaded on YouTube, almost 6
million stories will be posted on Instagram. Almost 700
million messages will be shared across the world across different messaging
platforms, and almost 20 million swipes
will happen on the dating platform. So if you are a content creator,
a brand, a marketeer or an employer, you have to
stand out in front of your audience so
that they can engage with your content. There is no way
about that. Hi, I'm Gorapatra,
founder and CTO of Flurgo. Welcome to my talk.
The title of the talk will lead you to three different aspects of
fields combined together. What is an
interactive interfaces or what is interactive between physical and digital world?
Secondly, the landscape of generative AI and finally
crowdsourced. How does a human to
human interaction or communication? So natural communication
among women consists of primarily
mix of speech, which is verbal and nonverbal, things like
facial expression, hand gestures, eye motion,
smile, touch, body language and everything, which are nonverbal
signals. So these are basically multimodal
communication methodologies and very complex contextual,
which actually provides a very deep complementary
waves than any model methods
or signals of communication. Right,
next, about how a human communicates or interacts
with a machine, either today or in the future.
So the next cofounder could be one that combines forward and nonverbal signals
to enable natural modality of communication between human
to computer, between computer to computer as well.
And it acts as a foundation for collaboration or cooperation between
people and AI systems. It starts
with a graphical interface which can be viewed, clicked,
swiped, touched, pinched, and also that exists
today already. Then there is voice recognition
technology, which is getting improved day by
day. It understands human language, natural language,
much more accurately and take action accordingly, process accordingly.
So it also eliminates the need of inputs,
devices, keyboards and all. So that exists today,
like in the forms of Alexa, which might currently
take your instructions and do actions accordingly,
but has a lot of potential in future to make decisions or
take action on your behalf without your instructions.
Now there is gesture and motion control.
With augmented reality, people are playing or interacting with
digital objects with their gestures in a much
more immersive and interactive way.
And that also includes manipulating the digital objects
in the 3d world. If I haven't heard of the
project Soli from Google, you should check
it out. So it's a multipurpose sensor based
miniature radar, which track human motion. Human has gesture
and take action accordingly. So it's an interpretation of the natural
hand gesture language. You can do
a virtual dialing instead of just
a screen on it, right? So these kind of things are
there. It mimics your action also to do
what you are doing on a virtual plane,
assuming that it is a real one. It also does
not depend upon any ambient light and any
external dependency it has. And also
the sensors are captured in a very compact shape,
which can be even embedded to other devices.
So it comes as a full suit.
SDK another under
research to different types of interfaces
and brain computer interfaces and haptic interfaces. So brain
computer interfaces with neuroscience, with the advancement of neurosciences, people can
actually interact with the machine from their thinking, from directly from
their brain, without doing any verbal or nonverbal
gestures. And haptic interfaces are very interesting.
So this depends upon they
claim that you can actually get physical sensations from a
virtual world, right? So it's a
communication between two set of sensors, and in between there are intelligent
machines, and if you are doing some action
from one side, the other person on the other side can actually feel that.
Right. It's a very interesting thing
that people are actually researching on a lot of sensors and all
are being involved into that another set
of interface, which I feel like it's called no
interface. So it's about a connected world.
Sensors and machine learning algorithm, digital devices
all put together, which can take decisions
on your behalf without you to actually instruct on anything.
It analyzes your environment and augments
your decision making. And it basically
works as an additional brain and additional
hand to what a human being
actually can do, can interact with the
machines. Now, let's go a bit deeper into
generality behind the elephant in the room. So for that, we want
to understand a bit of the history. In 1950s
and 60s, researchers had started
developing simple AI systems to perform smaller
tasks, but these
were very primitive. Then it started maturing in 1980s,
where researchers started creating much more sophisticated AI
algorithms, whereas it actually took
shape in 2010 plus in the last
decade, only with the advent of Gn,
generative adversarial network. So that
was developed by Jan Gottfellow and his colleagues in
2014.
So the major purpose was to generate realistic
images. So it has two models combined
together, one which generates fake images,
another which attempts to distinguish
between a fake image with a real image.
So fundamentally, GN generates
image lookalike with the set of data it
is faded with. That is the fundamental process of Gn.
I'll go deeper into it and discuss on the
architecture a bit.
However, GN is
not the only one in terms of generative like.
Generative, AI can be broadly distinguished into two different
stories, one consisting of GN which used
for text synthesis, image synthesis, music synthesis,
this kind of generative stuff. Another, the language models,
which are mostly used for emerging text. These two are fundamentally
two different set of networks. I'll come into the
much more depth of it.
You are from ML background, you must know the significance of
an objective function. I'm not going into the
detail. So in gams,
what happens? The objective function is based on an accessory training,
where there are two networks. One is the generator network
and other is the discriminator network. While the generator always
stories to minimize the discriminator's ability to distinguish between
the fake and the real image, let's consider image. It can
be for other type of inputs also,
whereas the discriminator always tries to maximize its
ability to distinguish
between the fake and the
real images. Right?
Whereas in language models, what happens? The objective function is
based on maximizing the likelihood of the next word in the
sequence, that is, giving them the previous word. So that's how a
language model, large language model, works.
Also, if I draw an analogy to
how a creative process works for actual human being,
if you think fundamentally everyone
is creative, right? So how do
we leverage creativity? We imagine something, we visualize
that in front of us. There are multiple tools available to
do that. We iterate
on that, we change something, we reimagine
something and visualize again. Then again,
reimagine like that. So that's how creativity is
an iterative process.
If you think fundamentally how a gan
works, it also does the same thing, right? So it
is an iterative process to make the engine more
and more creative. So we are
going towards the direction where GNS
can actually augment the human creativity. Maybe researchers
are going towards that.
Now, a little bit of large language models or language
models. So these are typically neural network called transformers.
Most of the LLMs are based on transformers, which processes
a sequence of data. For this
instance, it is text. What it does, it actually
creates a probability distribution of the next set
of tokens given an already input set of tokens,
right. And from that probability distribution,
it predicts the correct next token in the
sequence, which is given in the previous set of problems.
So fundamentally, LLNs use a function or
a technique called self attention. So it
allows this particular neural network to attend different
parts of the input sequence, or all the input tokens
with different weightage, while making the next prediction.
Elements are generally trained with large amount of data and
sometimes fine tuned for specific use cases, be it legal,
be it writing for your copy or this kind of things,
or it can be a translation or questions and model as well.
So once trained properly,
elements can generative AI next set of
text sampling from the learned probability distribution over
the next set of tokens. So we
discuss about gins and we discuss about language models.
What I fundamentally believe, if these two are combined together,
it has enormous potential to generate actually multimodal
interfaces, which will in turn lead
to creating interactive content and
changing the way how human beings are actually
interacting with the content. Nowadays, which is mostly static
content, I consider video is also
static content because there has not been any major
revolution in the content creation space, in the
content format space. After video,
people have started creating like interactive videos and
those kind of stuff a bit. But there has not been any revolution because
the generation process of that,
including the ideation, creativity,
leveraging till the implementation,
as well as how we consume that is very
complicated. When you talk about generative AI,
multimodal contained as at ease.
So we fundamentally believe at flurgo,
and we have been researching on that, that generative
AI is opening the tools for creating that multimodal
interactivity. Having said that, for large language
models, there are also another school of thought which thinks that
adding more and more parameters to the model, or creating
more and more larger set of models might not be the only
solution for emerging creativity in terms of the
text generation. And I also do fundamentally believe that
because with the likes of data privacy, with the likes of
the need of running a model on your local devices,
on your handheld devices, it is very much necessary for
us as the research community,
as the tech community, to innovate on the type of model or
the type of language
models doing almost the same kind
of purpose. People are thinking towards this direction.
It's just a very exciting and interesting field to
watch out for in the next couple of years into
the architecture of a JN model.
So as I mentioned earlier, it has two different netflows
combined together. One is a generator network, another discriminator
network. So the generator takes a random noise as input
and generative AI data that resembles the training data.
And the discriminator's responsibility is to take
that real training data and distinguish between
the created data, like what is the variance between
that? Right? During the training process,
the generative AI, the discriminator, are trained with an
advisorial manner, where the generative AI to fool the
discriminator, and the discriminator aims to correctly distinguish
between the real and the generated image. And over the time, the generator learns
to generate more increasingly realistic data
that can fool the discriminator. That this
is actually real, but actually it
is a fake data, that is the fundamentals.
And once trained properly, the generator can be
used to generate new set of data, which is similar to the training
data set. So fundamentally,
language models are getting more and more competent in
terms of understanding the human language
and process that. So it is getting like
application across the industries, not only in
generating new text, rather also in predicting
or suggesting on any kind of textual analysis
related data.
For GN, fundamentally, it generates
the output which is lookalike or which is similar
to the training data set. And that's how the entire model is structured,
architected. Right? So we can say
that the model may be actually working
on a crowdsourced data it is presented
with, and understanding how genetic
models work and why they produce the output they
do. Right now, it's very important in terms of the research direction.
This also includes analyzing the internal representation of
the model, how they are learning,
and how they are explaining
or how the model is
getting explained or interpreted in terms of how they
are generative AI output. So that
is the fundamental idea
of generating something
which is lookalike or which is influenced by
a crowdsourced creativity which is fed into the generative
and physical network.
Fundamentally, generative AI is at a very nascent stage,
very fundamental building block stage. And apart
from watching out for the explainability, there are
multiple other things.
From the model understanding perspective, it is worth
watching out for. So people are working
towards generating more improved training techniques.
One area of active research is developing better techniques for
training generative models, such has more stable and efficient optimization
algorithm methods, or avoiding mode of collapse
of gn like that. Multimodal generation
is also being looked at
very closely. Different techniques
are getting developed into that. So how we can generate
an immersive content, which is
a combination of text, video, images, audio, everything putting
together, which is multimodal generation.
People are also researching towards controlled generative
AI means like developing generative
models which can be controlled to generate in a
specific direction or in a specific type of output,
which is characterized based
on some defined criteria. And also people
are researching with respect to incorporating
prior knowledge to that prior knowledge to the
generative models. So this includes developing models that
can learn for structured data such as graphs,
tables, or that can also leverage external knowledge
sources such as knowledge graphs.
So where all we are thinking about crowdsourcing interactivity,
one is definitely for idea generation, where different
variations of iterative creative process is
sourced from different sources,
different people who are creative and who are good
at different areas of adding interactivity. Someone might
be good at generating the interfaces, someone might be good at generating
the background. Generative AI assets generating how different
assets interact. What is the logic between
the interaction between different objects, how people are
interact with the objects, generating content and all those
stuff? Experimentation is definitely one
of the areas of crowdsourced
interactivity through generative
for sure, because rapid experimentation leads
to more and more creative content
and rapid creating
hype in that particular direction, that particular build.
Refinement is definitely another very important area.
So when I get something
in front of me, I use AI
to refine that process. Let's say extending the background
to an infinite image, getting the background much
more crisp, tweaking with the brightness,
the other aspects of the interfaces,
and also with the interactions, how people swipe,
how people do animations on top of it.
Everything needs a lot of refinement. When I able
to see that and I get an AI
assistant with me to work on that,
that process becomes much more smoother.
And one very crucial area is
generating multimodal interfaces. As I mentioned, that generative
AI has a potential to democratize creation
of text, video, images,
audio, everything put together to create a multimodal immersive
experience, both in 2d
as well as in the 3d panel.
Personalization, right?
So the future of content is going to be adaptive and personalized
based on how people are interacting with it, right? For example,
you're creating an employee onboarding process.
It has to be interactive. It has to be personalized for
each of the employee so that they
get that value out of it and they engage with your
particular set of onboarding content.
This is my most favorite area, how we are experimenting
with a particular example.
At the end of the next few
slides, you will get to know what we are trying to build here.
I'm just giving you a few context. So if you are actually trying
to create an engaging and interactive content, you'll need different
set of features which exist today in come form
or other, and put things together with an AI assisted
way. So the first thing is image synthesis.
It can be synthesized from an existing real image that
gets enhanced with AI, or it can be completely generated
with prompts. This is an example that I have created for this particular
talk. So I choose to create this
from scratch with stable diffusion
through prompts. And I have given example of two or
three different prompts.
Next, let's say I have to make this landscape infinite,
so I can imagine
it as a long platformer, sort of
game like interface. So I'm doing
that with AI itself, and that is context.
If you see the infinite landscape,
it is similar to the one image that I've chosen
from the generated one,
which is the unicorn like
walking on the working on running on the surface of Mars.
Now I integrate with a language model like GP
four to generate an idea where
I give the prompt, let's say a gamified startup
idea validator. It gave
me the name unique order. I can choose my own name
as well and I be
able to launch it for the early stage startup founders as
my perfect PG. So language models
are so powerful nowadays that it can not only
give you or generate the questionnaires
of the format of the entire game, but also it
can analyze if the PG
who are interacting with the particular content
is giving some responses, can also analyze,
give them suggestions, and finally, it can
give you a come based on the clarity of idea
that you have as a startup founder for this particular use
case. So I'll just quickly take
you through the next step where
I have taken a screenshot of someone actually playing the
game. Now it asks
you first about creating
write down your startup idea.
Now let's say cofounder
responds with something called I am planning to create a decentralized
investment platform. Now it guides me through the
next set of questions. What are the things I am as
a startup founder once you validate and whether
I have clarity on that, based on that, it gives me come score.
I have questions for you. So how do you the fractional
ownership part, as it is not
legal in India, the cofounder
says okay, I'll form an LLP and it
holds the share in the assets. It validates and
says okay, that is a good thing. Now how you validate
your market size or the growth potential,
right? So it creates as
a creator, if you want to create something which
is a startup idea validator and wants to launch it for your
target audience, which is the startup
founders. So that's about it.
It creates an entire interactive interfaces, even that
UI can be created
through generative. A lot of tools are already existing
like Gallery U and all. So we are integrating everything together at
Florida to create the entire thing. And it
can utilize for employee onboarding, can utilize for customer
acquisition a lot of other places where you
want to stand out as a content creator,
as a brand, has a marketeer, has an employer with
respect to your content form. That's it
like us formally currently
is very in an ascent format.
As a startup founder you are
growing at a 20% easier. It manages
and says has a good potential. Then it
asks about my awareness with respect to the
GPM, how I'm going to create awareness about my
product like that. So it's all about the idea.
Lastly, we're living at a very exciting times where I believe
gone are the years where you generally operate
your machines or operate your systems.
You will rather cooperate with your machines. In the future
it will be an augmented intelligence,
but it
is going towards a direction where AI
will not only act as what assistant,
can also take decisions on your behalf, can enhance your creativity,
can augment your creativity with the craft source
creativity. And we at Turbo
are going towards a direction where we believe the
future of storytelling is going to be immersive and interactive.
And in order to facilitate that, in order to bring more and more creators onto
creating immersive and interactive experience, not only
the formats which are given and restricted by
the social media platforms, nowadays you need to have the native
AI innovating in those directions,
innovating in actually creating multimodal interfaces
for storytelling at different use cases. Thank you
guys for joining.