Conf42 Large Language Models (LLMs) 2024 - Online

Transforming Content Creation with Collaborative AI: A Groundbreaking Approach

Abstract

Discover a new paradigm in content creation: collaborative AI. Learn how leveraging multiple LLMs in an iterative feedback loop led to outputs surpassing the original human-created content—all in mere minutes. Unlock the transformative potential of collaborative AI to achieve content perfection.

Summary

  • Collaborative AI harnesses multiple llms in order to achieve a high quality output. The bar for creating educational content is usually very, very high when it comes to factuality. With AI, we are now able to get content out much faster, but not without potential pitfalls.
  • Next, we wanted to define criteria that we were looking for in good writing. And then the second pre step was to ask it to generate a sample as close to a ten out of ten as possible. Finally, we asked cloud three and chat GPT four to identify characteristics of the original sample and score it.
  • Both versions were much better than the first versions. Both market improvements simply by using just one round of collaborative AI. The human in the loop agreed with them in terms of the improvement. But at times they were maybe trying to be a little too engaging, a little bit too fun.
  • Collaborative AI can be a huge addition to whatever AI they're currently using. Quality of content and speed is vital. High quality and engaging content that will really help people learn something.
  • A 9.5 out of ten for one article isn't quite the same as a corpus or a body of articles. One way around this is to feed models with high caliber human text for inspiration. To use AI speak in a world where the possibilities are limitless, collaborative AI could be a game changer.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, today I'm going to talk about a creative approach I've devised that harnesses multiple llms in order to achieve a high quality output. It's something I call collaborative AI, and my hope today is that you will be able to unlock its potential to really push the upper limits of what is possible in terms of content quality and LLM output. Today I'm going to use Chachi PT four and cloud three opus to showcase the power of collaborative AI. Before I do so, I wanted to talk a little bit about how content creation pre AI has typically played out in my field. Online learning. The bar for creating educational content is usually very, very high when it comes to factuality. We hire what are known as SME's or subject matter experts, and their job is to be as accurate as possible, both in writing and and in reviewing content. They essentially are the domain experts of their field, be it art history or upper division calculus. What they are not oftentimes is professionally trained writers. As a result, their writing, while grammatically sound, factually accurate, can sometimes come across as a little bit dry, unengaging, and a bit repetitive. But this isn't helped by the fact that the amount of content often needed by online providers is staggering, and SME's have to work under a tight deadline. Engaging writing with memorable examples, smooth transitions, and that writerly touch are oftentimes out of reach for SME's, even those with professional training in writing. With AI, we are now able to get content out much faster, but not without potential pitfalls, for one, hallucinations where the LLM generates inaccuracies and even outright falsehoods. There's this fear that for first time learners, they might end up thinking that the civil war happened only last century. While such glaring falsehoods aren't necessarily that common, smaller inaccuracies do occur. Then there's the question of writing do AI models have the ability to take otherwise potentially dry educational content and make it exciting and interesting while still being accurate and and able to convey a sense of authority? The PowerPoint presentation that follows I'm going to take a piece of educational content that I'm going to have Claude three opus and Chachi Pt four evaluate, and then I'm going to have them generate their own versions. But I'll go further than that, leveraging collaborative AI as I take inputs and outputs from one LLM and feed them into the other, using collaborative AI to improve upon those outputs so that the final product is greater, better than anything that either LLM could have generated by itself. So let's dive in and see collaborative AI in action. So here we are with transforming content creation with collaborative AI. The first thing I did was create a little experiment. The idea here was that we needed to take some baseline, some standard content that the llms could improve upon, and we needed to make sure that there was some scoring around that baseline sample. Otherwise it would be difficult to say whether and how much the other LLM generated outputs improved by. So first off, we needed a piece of education content, something that could serve as our baseline. And of course we needed to choose a topic as well. And then we needed to define the criteria of quality, like what made this a strong or not strong piece of writing, and what did we want the two llms, in this case chachi pt four and clot three opus, to focus on when generating a high quality sample. So that again speaks to the idea of to establish a quality baseline. So what I did is I had chat GPT four actually write a chapter, and then later clod three scored that chapter. Now I'm going to go through each one of these parts, starting with the piece of education content and then ending with more details about establishing that quality baseline. First off, the piece of content, I decided it was going to be a 500 to 600 word article on large language models, and I think that makes sense given the target audience. But I didn't just have it write the large language model. Instead, I fed Claude 33 online learning excerpts from different topics, different areas, something that it could model when it actually generated its own article, and I chose samples that were indicative of more average online learning content. So there's a lot of great online learning content out there, and I definitely don't want to cast aspersions upon the field, but I was going for something a little bit more average, something that if someone was under a deadline, they might end up creating. And so I fed that to Claude three and had it actually characterized the writing, which is a step I like to do with llms. It's a reflection, sort of step in between before they actually generate an article. Now, you don't have to do this, but it's something that I did before it actually generated the writing. And in doing so, it identified eight characteristics from these excerpts and also was a sanity check just to make sure that what I thought wasn't great writing, that it could back me up there as well. And indeed it did here. Came up with a total of eight. I've only posted six and a half here, but it gives you an idea. The point here isn't to read through each one of these, but that there definitely are lapses in quality. And now once the LLM has that, it can generate this piece of content here, which again is a 500 or 600 word chapter on llms. And I actually used my editorial eye just a little bit and looked at those eight characteristics as well, courtesy of cloud three. And I changed, tweaked just a few things, but nothing major. And this is what we had, or what we ended up with. Again, not going to pause here too long, I don't think the point here is to really read this. In fact, I only excerpted it because this is clearly not 500 to 600 words, but the actual thing from which, or the text from which this is exerted was around the 500 mark. So usually llms aren't that great at counting, but they did a pretty good job here. But what is mediocre about this? Let's just really quickly look at that first sentence where it says llms are a type of AI, artificial intelligence. Then notice the second sentence llms use. So it repeats that exact same noun, and that gives rise to a repetitive, dry kind of writing. And if you dive in here a little bit more, you'll see that as well. Third paragraph starts off with llms, but the idea here is it's the quality of writing that we're going for, and it just doesn't hit the mark. Next, we wanted to define criteria that we were looking for in good writing. So when we have the llms create quality output, what are we defining as quality? So we marked here some criteria. The LLM, in this case cloud three, was able to come up with five categories here, engaging language and storytelling, relatable examples, thought for broken questions, sentence structure, clarity, et cetera. And this is what it identified. And what I agreed were hallmarks of strong, engaging educational content. So again, in establishing the baseline, we got a score out of four out of ten. But I didn't want to just stop there. I asked myself, what if we just asked the LLM in a one shot prompt to come up with a chapter for an education course, online learning course on llms, what would it come out with? And it came out with something, this one shot prompt, and it got a seven out of ten. Now, I'm not going to paste that here, but I'll say it was high level, generic, typical AI stuff. This was a good baseline for me, because if we use collaborative AI here, or if I use collaborative AI and it turns out I also get a seven, then there doesn't seem to be much point in collaborative AI when you can just do a one shot prompt that will get you a decent seven out of ten score. But let's see what actually happens when we use collaborative AI now. You'll notice it says pre step one, so we're not quite there, and sorry to be teasing you on this, but we're almost there in the next slide. For now, though, the pre step prompt I did was I asked cloud three and chat GPT four to identify characteristics of the original sample and score it. That's that reflection piece that we did earlier on. So this isn't an integral part of collaborative AI, but just something nice to do. And then the second pre step was to ask it to generate a sample as close to a ten out of ten as possible. And this is where the collaborative AI process and machinery now starts with step one. Here, what I did was I input a version one from each LLM into the other one, asking to evaluate it on a score from one through ten. So, for example, clot three created a version on that pre step number two just a second ago, created that version one, and then I fed that version into chat GPT. But look at that second part there, where it says other LLM comma. That's the second part asking to evaluate. So I didn't just input the version, but I actually asked it to rate it and score it, much the way a teacher or a professional would do. And so with that evaluation in hand, then go to the next step here, which is take the version one evaluation from an LLM, or from one of the llms, and put it back into the other LLM. I know this can get a little crisscrossy, but to give you an example here, the chat GPT's evaluation, which was on version one of cloud three. I then put it back into cloud three, but then there's a second part to step two, which is inputting it into the first LLM for rewrite. And that brings us to step number three, where I take the rewrite, which we're now calling version two, and I input it back into the other LLM for evaluation and scoring. And so the thing is, step back for a moment. We can think of it as I gave cloud three an opportunity to do a rewrite the way we would in a classroom, and then we get feedback from a teacher. And version two is its rewrite based on this evaluation and scoring. And then at that point, was there a difference between version one and two in terms of score? Now you can carry this process on and on. You could have a version three, a version four, version five. But I think at a certain point there are diminishing returns. And so what we're trying to see here in this little experiment is, was there an improvement between version one and version two? So let's see what happens before we get too excited. We have a step number for it. I think it's very important is to check for hallucinations by inputting version two into the other LLM. So essentially, we're using collaborative AI to do hallucination checks. I know I threw a lot of text and words at you, but if you pause here for a moment, you can see this collaborative AI structure use here, spread out here for each one of the steps. Again, there could be more steps if you wanted to do more than just two versions. But this is the bare bones, basic little experiment version that we are doing here. So maybe you're curious now, what was chat DPT in cloud three's first version, and how was that scored? I'm happy you asked. Let's dive in here. The chat GT's first version, we can see that it gets a seven out of ten based on criteria of engaging writing, etcetera, which isn't great given that the one shot prompt also got us a seven out of ten. But at least it's a starting point. Hopefully the second version will be better. And how did Claude three do? Let's see what teacher Chachi Pt four has to say. They give it. If you look there at the bottom of the first paragraph, it says, I would rate this version an eight out of ten. So it did a little bit better. But for now, this is enough to give you an idea of how this works. So here's the second step. We feed the evaluations from one LLM back into the other for a rewrite. Now, in this case, what I did was I actually exerted the entire thing, and I did that for a reason. I think it's important to see just how detailed these evaluations are. So when the LLM is getting its feedback, you can think of it as a prompt. Imagine writing a prompt that is this long and, aha. Even longer. So that's not necessarily bad thing, given that LLMs often thrive off of this level of specificity, and there is a lot of specificity going on, does it actually amount to anything in version two? Meaning, will the LLms write a better version of the chapter? I'm happy you asked, because now we're at the point where we can ask it, come up with a version that gets a perfect tense, so we've definitely raised the bar, but there's a lot of specificity. And that, of course, is what makes collaborative AI so powerful. How does this rate we get this is the version two from Claude, and this is the version two from ChatGpt. And Drumroll. Their scores a 9.5 out of ten. So you can see this is Claude rating chat GPT. So even though chat GPT's first attempt was a measly baseline seven, this one got close to a ten, and Claude's version got a solid nine, if you look at the last or the bottom of the first paragraph. But in both cases, the version two was much better. So we can see the third step scores here. Version one, seven for chachi, BT for claude, eight. Version two, 9.5 and nine. So both market improvements simply by using just one round of collaborative AI. Now, you might be asking, well, what about the human in the loop? In this case me? Did I agree with these versions? And so, yes, I did read these versions, and I agreed with them in terms of the improvement. They were both much better than the first versions. And in fact, based on the criteria that established, I essentially agreed with these scores. The reason I hesitate in saying I wholeheartedly agree with them was I felt at times they were maybe trying to be a little too engaging, a little bit too fun. But that wasn't necessarily part of the evaluation criteria. So that's something that, as the human in the loop, I can ask in a version three just to tweak that. So it doesn't sound like you're trying to be someone's pal and trying to be too relatable. But again, everything besides that was at a much higher level, making it solid educational content that I think would really pull in audiences and make learning so much more fun and enjoyable. Before we go on, though, I want us to compare here to the baseline scores just one last time, just to see where we came from, you know, speaking about education content being more fun and enjoyable. If you're going from a four out of ten, again, not all education content that's human created is a four out of ten. But if kind of the average ish is, then to go from four to nine to 9.5 is a huge improvement. And the fact that collaborative AI, at least just with this one round, is something that doesn't take long at all compared to some of these editorial and content creation processes that involve multiple SME's, both creators and reviewers, several rounds, and someone oftentimes overseeing that entire process, you can see that this can be costly and again, if that standard is only a four, then we're also getting a huge, huge bump in quality. Finally, there's that hallucination check. Both pieces passed. I think, though, it's always super important to have a human in the loop, especially for something like education content, where you do not want to have facts that are incorrect no matter what. Now for the broader implications. Quality of content and speed is vital. Collaborative AI can be a huge addition to whatever AI they're currently using. Now, if they're not using any AI, then they enter AI or LLMs at a much higher level than they would with one shot, prompting digital marketing, PR, and corporate communications. These are just a few here of the areas where high quality and engaging content that will really help people learn something. For instance, the case of healthcare communication. If something is dry and drought, patients aren't likely to remember that, make it engaging at that nine 9.5 level, then suddenly it's something that's a lot easier for them to learn and pay attention to, something that's really important in health. But again, coming back to that hallucination thing, always a human in the loop for many of these different industries. So now for the closing thoughts. This is an interesting one. It's the idea that a 9.5 out of ten for one article isn't quite the same as a corpus or a body of articles. Why? Well, imagine that 9.5 that we saw, and I believe that was the one that Chachi bt outputted. Imagine that that was the exact same one over and over again. Now I just use that word imagine twice in a row, almost as a joke. And the reason that was a joke is back earlier in the presentation. It might not have been obvious because I didn't focus on it, but two of the articles use the word imagine a world, or imagine something to start off, and that alone is a little bit confusing. And it shows that if you had that for, say, 50 articles, and maybe 20 of them had that phraseology. So correcting for something like this at scale is something that's important to keep in mind early on when you are creating these pieces. And again, having that human in the loop, someone who can really wield the AI and wield something like collaborative AI, will make it less likely that you're going to see very common or similar opening lines or really similar writing throughout. But that said, civil writing is part of AI, and there's almost this AI speak. After all, we have the GPT zeros of the world that can identify AI generated language or text for a reason. It's because there is a certain pattern that makes it slightly different from human generated text. One way around this is to actually feed models with high caliber human text for inspiration. And so if you want to make it sound even more like a person, more relatable, perhaps make it so that it's not always saying, imagine a world where and using some of those other giveaways of AI speak, then using high caliber human pros that you want it to model is a good way around that. These are just a few ideas of improving the AI using collaborative AI, but in general, lots of different ways that we can use collaborative AI. Not just coming up with subsequent versions, one after the other, but maybe even leveraging more than two models, having three models, having a model judge its own writing in a different thread and then compare it to what the other LLM said. See how similar those are. There's so much you can do here again, and to use AI speak in a world where the possibilities are limitless, collaborative AI could be a game changer.
...

Chris Lele

Data Science and Machine Learning Fellow @ ElevateAICoaching.com

Chris Lele's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways