Conf42 Machine Learning 2023 - Online

Generative AI will enable crowdsourced interactivity

Video size:

Abstract

After the text, image, and video, there has not been a new format of content that is mainstream.

With Generative AI at the forefront, tools are emerging to revolutionize interactive content creation. Imagine a future where stories come to life through interactive interfaces.

Summary

  • By the time my talk will be over on, almost 10,000 hours of content will be uploaded on YouTube, almost 6 million stories will be posted on Instagram. What is an interactive interfaces or what is interactive between physical and digital world? Secondly, the landscape of generative AI and finally crowdsourced. How does a human to human interaction or communication?
  • Next cofounder could be one that combines forward and nonverbal signals to enable natural modality of communication between human to computer. And it acts as a foundation for collaboration or cooperation between people and AI systems.
  • SDK another under research to different types of interfaces and brain computer interfaces and haptic interfaces. People can interact with the machine from their thinking, from directly from their brain without doing any verbal or nonverbal gestures. It's about a connected world.
  • Generative adversarial network was developed by Jan Gottfellow and his colleagues in 2014. The major purpose was to generate realistic images. Generative, AI can be broadly distinguished into two different stories. I'll come into the much more depth of it.
  • In large language models or language models, objective function is based on maximizing likelihood of the next word in the sequence. How creativity is an iterative process. We are going towards the direction where GNS can augment the human creativity.
  • If these two are combined together, it has enormous potential to generate actually multimodal interfaces. This will in turn lead to creating interactive content and changing the way how human beings are actually interacting with the content. People are working towards generating more improved training techniques.
  • So where all we are thinking about crowdsourcing interactivity, one is definitely for idea generation. Experimentation is definitely one of the areas of crowdsourced interactivity through generative for sure. Refinement is definitely another very important area.
  • Generative AI has a potential to democratize creation of text, video, images, audio, everything put together to create a multimodal immersive experience. Future of content is going to be adaptive and personalized based on how people are interacting with it.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
By the time my talk will be over on, almost 10,000 hours of content will be uploaded on YouTube, almost 6 million stories will be posted on Instagram. Almost 700 million messages will be shared across the world across different messaging platforms, and almost 20 million swipes will happen on the dating platform. So if you are a content creator, a brand, a marketeer or an employer, you have to stand out in front of your audience so that they can engage with your content. There is no way about that. Hi, I'm Gorapatra, founder and CTO of Flurgo. Welcome to my talk. The title of the talk will lead you to three different aspects of fields combined together. What is an interactive interfaces or what is interactive between physical and digital world? Secondly, the landscape of generative AI and finally crowdsourced. How does a human to human interaction or communication? So natural communication among women consists of primarily mix of speech, which is verbal and nonverbal, things like facial expression, hand gestures, eye motion, smile, touch, body language and everything, which are nonverbal signals. So these are basically multimodal communication methodologies and very complex contextual, which actually provides a very deep complementary waves than any model methods or signals of communication. Right, next, about how a human communicates or interacts with a machine, either today or in the future. So the next cofounder could be one that combines forward and nonverbal signals to enable natural modality of communication between human to computer, between computer to computer as well. And it acts as a foundation for collaboration or cooperation between people and AI systems. It starts with a graphical interface which can be viewed, clicked, swiped, touched, pinched, and also that exists today already. Then there is voice recognition technology, which is getting improved day by day. It understands human language, natural language, much more accurately and take action accordingly, process accordingly. So it also eliminates the need of inputs, devices, keyboards and all. So that exists today, like in the forms of Alexa, which might currently take your instructions and do actions accordingly, but has a lot of potential in future to make decisions or take action on your behalf without your instructions. Now there is gesture and motion control. With augmented reality, people are playing or interacting with digital objects with their gestures in a much more immersive and interactive way. And that also includes manipulating the digital objects in the 3d world. If I haven't heard of the project Soli from Google, you should check it out. So it's a multipurpose sensor based miniature radar, which track human motion. Human has gesture and take action accordingly. So it's an interpretation of the natural hand gesture language. You can do a virtual dialing instead of just a screen on it, right? So these kind of things are there. It mimics your action also to do what you are doing on a virtual plane, assuming that it is a real one. It also does not depend upon any ambient light and any external dependency it has. And also the sensors are captured in a very compact shape, which can be even embedded to other devices. So it comes as a full suit. SDK another under research to different types of interfaces and brain computer interfaces and haptic interfaces. So brain computer interfaces with neuroscience, with the advancement of neurosciences, people can actually interact with the machine from their thinking, from directly from their brain, without doing any verbal or nonverbal gestures. And haptic interfaces are very interesting. So this depends upon they claim that you can actually get physical sensations from a virtual world, right? So it's a communication between two set of sensors, and in between there are intelligent machines, and if you are doing some action from one side, the other person on the other side can actually feel that. Right. It's a very interesting thing that people are actually researching on a lot of sensors and all are being involved into that another set of interface, which I feel like it's called no interface. So it's about a connected world. Sensors and machine learning algorithm, digital devices all put together, which can take decisions on your behalf without you to actually instruct on anything. It analyzes your environment and augments your decision making. And it basically works as an additional brain and additional hand to what a human being actually can do, can interact with the machines. Now, let's go a bit deeper into generality behind the elephant in the room. So for that, we want to understand a bit of the history. In 1950s and 60s, researchers had started developing simple AI systems to perform smaller tasks, but these were very primitive. Then it started maturing in 1980s, where researchers started creating much more sophisticated AI algorithms, whereas it actually took shape in 2010 plus in the last decade, only with the advent of Gn, generative adversarial network. So that was developed by Jan Gottfellow and his colleagues in 2014. So the major purpose was to generate realistic images. So it has two models combined together, one which generates fake images, another which attempts to distinguish between a fake image with a real image. So fundamentally, GN generates image lookalike with the set of data it is faded with. That is the fundamental process of Gn. I'll go deeper into it and discuss on the architecture a bit. However, GN is not the only one in terms of generative like. Generative, AI can be broadly distinguished into two different stories, one consisting of GN which used for text synthesis, image synthesis, music synthesis, this kind of generative stuff. Another, the language models, which are mostly used for emerging text. These two are fundamentally two different set of networks. I'll come into the much more depth of it. You are from ML background, you must know the significance of an objective function. I'm not going into the detail. So in gams, what happens? The objective function is based on an accessory training, where there are two networks. One is the generator network and other is the discriminator network. While the generator always stories to minimize the discriminator's ability to distinguish between the fake and the real image, let's consider image. It can be for other type of inputs also, whereas the discriminator always tries to maximize its ability to distinguish between the fake and the real images. Right? Whereas in language models, what happens? The objective function is based on maximizing the likelihood of the next word in the sequence, that is, giving them the previous word. So that's how a language model, large language model, works. Also, if I draw an analogy to how a creative process works for actual human being, if you think fundamentally everyone is creative, right? So how do we leverage creativity? We imagine something, we visualize that in front of us. There are multiple tools available to do that. We iterate on that, we change something, we reimagine something and visualize again. Then again, reimagine like that. So that's how creativity is an iterative process. If you think fundamentally how a gan works, it also does the same thing, right? So it is an iterative process to make the engine more and more creative. So we are going towards the direction where GNS can actually augment the human creativity. Maybe researchers are going towards that. Now, a little bit of large language models or language models. So these are typically neural network called transformers. Most of the LLMs are based on transformers, which processes a sequence of data. For this instance, it is text. What it does, it actually creates a probability distribution of the next set of tokens given an already input set of tokens, right. And from that probability distribution, it predicts the correct next token in the sequence, which is given in the previous set of problems. So fundamentally, LLNs use a function or a technique called self attention. So it allows this particular neural network to attend different parts of the input sequence, or all the input tokens with different weightage, while making the next prediction. Elements are generally trained with large amount of data and sometimes fine tuned for specific use cases, be it legal, be it writing for your copy or this kind of things, or it can be a translation or questions and model as well. So once trained properly, elements can generative AI next set of text sampling from the learned probability distribution over the next set of tokens. So we discuss about gins and we discuss about language models. What I fundamentally believe, if these two are combined together, it has enormous potential to generate actually multimodal interfaces, which will in turn lead to creating interactive content and changing the way how human beings are actually interacting with the content. Nowadays, which is mostly static content, I consider video is also static content because there has not been any major revolution in the content creation space, in the content format space. After video, people have started creating like interactive videos and those kind of stuff a bit. But there has not been any revolution because the generation process of that, including the ideation, creativity, leveraging till the implementation, as well as how we consume that is very complicated. When you talk about generative AI, multimodal contained as at ease. So we fundamentally believe at flurgo, and we have been researching on that, that generative AI is opening the tools for creating that multimodal interactivity. Having said that, for large language models, there are also another school of thought which thinks that adding more and more parameters to the model, or creating more and more larger set of models might not be the only solution for emerging creativity in terms of the text generation. And I also do fundamentally believe that because with the likes of data privacy, with the likes of the need of running a model on your local devices, on your handheld devices, it is very much necessary for us as the research community, as the tech community, to innovate on the type of model or the type of language models doing almost the same kind of purpose. People are thinking towards this direction. It's just a very exciting and interesting field to watch out for in the next couple of years into the architecture of a JN model. So as I mentioned earlier, it has two different netflows combined together. One is a generator network, another discriminator network. So the generator takes a random noise as input and generative AI data that resembles the training data. And the discriminator's responsibility is to take that real training data and distinguish between the created data, like what is the variance between that? Right? During the training process, the generative AI, the discriminator, are trained with an advisorial manner, where the generative AI to fool the discriminator, and the discriminator aims to correctly distinguish between the real and the generated image. And over the time, the generator learns to generate more increasingly realistic data that can fool the discriminator. That this is actually real, but actually it is a fake data, that is the fundamentals. And once trained properly, the generator can be used to generate new set of data, which is similar to the training data set. So fundamentally, language models are getting more and more competent in terms of understanding the human language and process that. So it is getting like application across the industries, not only in generating new text, rather also in predicting or suggesting on any kind of textual analysis related data. For GN, fundamentally, it generates the output which is lookalike or which is similar to the training data set. And that's how the entire model is structured, architected. Right? So we can say that the model may be actually working on a crowdsourced data it is presented with, and understanding how genetic models work and why they produce the output they do. Right now, it's very important in terms of the research direction. This also includes analyzing the internal representation of the model, how they are learning, and how they are explaining or how the model is getting explained or interpreted in terms of how they are generative AI output. So that is the fundamental idea of generating something which is lookalike or which is influenced by a crowdsourced creativity which is fed into the generative and physical network. Fundamentally, generative AI is at a very nascent stage, very fundamental building block stage. And apart from watching out for the explainability, there are multiple other things. From the model understanding perspective, it is worth watching out for. So people are working towards generating more improved training techniques. One area of active research is developing better techniques for training generative models, such has more stable and efficient optimization algorithm methods, or avoiding mode of collapse of gn like that. Multimodal generation is also being looked at very closely. Different techniques are getting developed into that. So how we can generate an immersive content, which is a combination of text, video, images, audio, everything putting together, which is multimodal generation. People are also researching towards controlled generative AI means like developing generative models which can be controlled to generate in a specific direction or in a specific type of output, which is characterized based on some defined criteria. And also people are researching with respect to incorporating prior knowledge to that prior knowledge to the generative models. So this includes developing models that can learn for structured data such as graphs, tables, or that can also leverage external knowledge sources such as knowledge graphs. So where all we are thinking about crowdsourcing interactivity, one is definitely for idea generation, where different variations of iterative creative process is sourced from different sources, different people who are creative and who are good at different areas of adding interactivity. Someone might be good at generating the interfaces, someone might be good at generating the background. Generative AI assets generating how different assets interact. What is the logic between the interaction between different objects, how people are interact with the objects, generating content and all those stuff? Experimentation is definitely one of the areas of crowdsourced interactivity through generative for sure, because rapid experimentation leads to more and more creative content and rapid creating hype in that particular direction, that particular build. Refinement is definitely another very important area. So when I get something in front of me, I use AI to refine that process. Let's say extending the background to an infinite image, getting the background much more crisp, tweaking with the brightness, the other aspects of the interfaces, and also with the interactions, how people swipe, how people do animations on top of it. Everything needs a lot of refinement. When I able to see that and I get an AI assistant with me to work on that, that process becomes much more smoother. And one very crucial area is generating multimodal interfaces. As I mentioned, that generative AI has a potential to democratize creation of text, video, images, audio, everything put together to create a multimodal immersive experience, both in 2d as well as in the 3d panel. Personalization, right? So the future of content is going to be adaptive and personalized based on how people are interacting with it, right? For example, you're creating an employee onboarding process. It has to be interactive. It has to be personalized for each of the employee so that they get that value out of it and they engage with your particular set of onboarding content. This is my most favorite area, how we are experimenting with a particular example. At the end of the next few slides, you will get to know what we are trying to build here. I'm just giving you a few context. So if you are actually trying to create an engaging and interactive content, you'll need different set of features which exist today in come form or other, and put things together with an AI assisted way. So the first thing is image synthesis. It can be synthesized from an existing real image that gets enhanced with AI, or it can be completely generated with prompts. This is an example that I have created for this particular talk. So I choose to create this from scratch with stable diffusion through prompts. And I have given example of two or three different prompts. Next, let's say I have to make this landscape infinite, so I can imagine it as a long platformer, sort of game like interface. So I'm doing that with AI itself, and that is context. If you see the infinite landscape, it is similar to the one image that I've chosen from the generated one, which is the unicorn like walking on the working on running on the surface of Mars. Now I integrate with a language model like GP four to generate an idea where I give the prompt, let's say a gamified startup idea validator. It gave me the name unique order. I can choose my own name as well and I be able to launch it for the early stage startup founders as my perfect PG. So language models are so powerful nowadays that it can not only give you or generate the questionnaires of the format of the entire game, but also it can analyze if the PG who are interacting with the particular content is giving some responses, can also analyze, give them suggestions, and finally, it can give you a come based on the clarity of idea that you have as a startup founder for this particular use case. So I'll just quickly take you through the next step where I have taken a screenshot of someone actually playing the game. Now it asks you first about creating write down your startup idea. Now let's say cofounder responds with something called I am planning to create a decentralized investment platform. Now it guides me through the next set of questions. What are the things I am as a startup founder once you validate and whether I have clarity on that, based on that, it gives me come score. I have questions for you. So how do you the fractional ownership part, as it is not legal in India, the cofounder says okay, I'll form an LLP and it holds the share in the assets. It validates and says okay, that is a good thing. Now how you validate your market size or the growth potential, right? So it creates as a creator, if you want to create something which is a startup idea validator and wants to launch it for your target audience, which is the startup founders. So that's about it. It creates an entire interactive interfaces, even that UI can be created through generative. A lot of tools are already existing like Gallery U and all. So we are integrating everything together at Florida to create the entire thing. And it can utilize for employee onboarding, can utilize for customer acquisition a lot of other places where you want to stand out as a content creator, as a brand, has a marketeer, has an employer with respect to your content form. That's it like us formally currently is very in an ascent format. As a startup founder you are growing at a 20% easier. It manages and says has a good potential. Then it asks about my awareness with respect to the GPM, how I'm going to create awareness about my product like that. So it's all about the idea. Lastly, we're living at a very exciting times where I believe gone are the years where you generally operate your machines or operate your systems. You will rather cooperate with your machines. In the future it will be an augmented intelligence, but it is going towards a direction where AI will not only act as what assistant, can also take decisions on your behalf, can enhance your creativity, can augment your creativity with the craft source creativity. And we at Turbo are going towards a direction where we believe the future of storytelling is going to be immersive and interactive. And in order to facilitate that, in order to bring more and more creators onto creating immersive and interactive experience, not only the formats which are given and restricted by the social media platforms, nowadays you need to have the native AI innovating in those directions, innovating in actually creating multimodal interfaces for storytelling at different use cases. Thank you guys for joining.
...

Gaurab Patra

Co-Founder @ Flurgo

Gaurab Patra's LinkedIn account Gaurab Patra's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)