Conf42 Python 2024 - Online

Unlock the Power of LLM: Build a HuggingFace Agent

Video size:


Empower your projects with HuggingFace agents and tools! Harness the strength of transformers in multi-modal tasks. Choose from curated tools or seamlessly integrate your own. This talk equips you to hit the ground running. Elevate your AI game – your readiness starts here!


  • A huggingface phase transformers is a popular state of the art machine learning library for Pytorch, Tensorflow and Jax. Predefined, curated tools exist. How can we create our custom tool?
  • An agent is a large language model or LLM, and we are prompting it to perform a specific set of tasks. If we leverage this possibility to generate a small piece of code and equip the agents with different tools, we can use it, power how you will see it.
  • A tool is something simple which represents a single function with a name and a description. Each tool, each function is dedicated to one very simple task. This is a great instrument to have a chained output. We can leverage the conception prompting versus coding.
  • Let's have a look at translation and audio generation tools. It can translate to and from over 80 languages and it can voice the text easily. Under the hood, the translation is being curated by the meta. No language left behind an LLB model.
  • Hagenphase provides you with several tools based on transformers. And we can have different tools. The first it's document question answering. The next is text segmentation. And also we can also create a custom tool. These tools are transformers agnostic, because they can use different models.


This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and let's talk about unlocking the power of ll them and building a huggingface agent. So first of all, as you might know, a huggingface phase transformers is a popular state of the art machine learning library for Pytorch, Tensorflow and Jax. And it provides a thousand of pretraining models to performed tasks on different modalities such as text, ovision and audio. And it was just a quick introduction, you might be already familiar with it. And let's go to our agenda. So today we're going to talk about what are agents, what are tools, how to set up and initialize the agent, how to use a predefined tools such as translation, image captioning, text to speech, what are other tools? Predefined, curated tools exist. And very interesting, how can we create our custom tool? So let's get started. Right. First, what are agents? An agent, let's just think about a general term, an agent. You might think about a person who you hire for performing different tasks. So for example, an agents can assist you in writing some publications or calling someone or publishing some post on social medias. So the general idea is that an agent is an assistant to simplify your life, right? And if we go back to the huggingface phase. Idea of agent. An agent is a large language model or LLM, and we are prompting it. So we are asking it to perform a specific set of tasks so the agents can be equipped by different tools. And we will talk just in a minute about what different tools are and why. Is it possible lecture language model can generate a small piece of text, a small piece of code in very good and efficient way. Maybe if we ask it to generate a whole script, it might be not so good at it, but generating like three lines, four lines of code, it can deal with it. If we leverage this possibility to generate a small piece of code, small piece of text, and equip the agents with different tools, we can use it, power how you will see it. And now let's talk about what are tools. So if you have a toolbox in your garage, you might have something like this and this and this and hammers. And all these tools are for specific tasks. Each tool does specific job. And you might know I'm not very good at it, but you might know how to use each tool for which task. So the same idea if we think about tools for llms. So a tool is something simple which represents a single function with a name and a description, right? Because each tool has its own name and we have a description of this tool, how can we use it? And each tool, each function is dedicated to one very simple task. And if we put this together, this picture is from the official hanging face agents tutorial. We might have the following structure. So we have an extraction which we can tasks or what we can prompt the agent for and it is translated to the prompt. So in this particular example, we ask the agents to read out loud the content of the image. So if we think in concept of it, might want to first understand what's on the image. And then to generate text is the first step. And the second step is to read out loud this text. So this creates a prompt and our agents literally language model understand. It has a toolbox right here, different tools, the toolbox and the agent understands that it can use image captioner to caption the image and the text to speech tools to read the text out loud. And it generates a code run a Python interpreter, a text being voiced. So this is how it works. But if we think in general, why should we care why it might be interesting for us. I would say that this is a great interaction experience, so we don't need to even know how to code. We can leverage the conception prompting versus coding so we can prompt and have the code out of our prompt. And this is a great instrument to have a chained output. So if you think about it, we can for example generate an image, then add some elements to the image, then maybe resize the image or generate image captioner, translate it, get the voicing of it. Because there are very different tools and we can add our custom tools, it is very flexible and we will learn how to add our custom tools. And now let's talk about, let's go a little bit hands on and learn how to set up and initialize the agent. As a prerequisite, we need a huggingface phase token. And depending on the agent we are going to use, in our case we're going to use OpenAI agent. We need an OpenAI API key and we need a bit of code. So let's see how it looks like and let me open the collab here and run a simple setup. I use the latest version of transformers here and I will need to pass my huggingface phase token which you can have it for free. You can go to huggingface phase hub and you can create your token there, read or write token. For this particular activity we can have read token and while it's running, I should also add that I will upload this code to GitHub repository and you will have this notebook in a Jupyter notebook format and you can use it and play with it. Let me just grab my token and tasks it here. I'm logged in and I will pip install OpenAI library here and I'm going to the agents initialization. I will use OpenAI agent and I need my OpenAI token here, let me grab it, let me tasks it. Voila, we have it. Now let's go back to the presentation and see what are predefined tools. The first tool we are going to look at is image capturing tool. It's very simple so we can have pretty much everything as an image and we can generate a caption for it. So as you can see we just call in the comment as agent run and a natural language prompt and we are passing an image as a variable here. Let's go and look how it would look like. So first I will just quickly choose a picture and I will use foot some foot as a picture and I will generate a description for my picture in English. And as you can see here, while agent is running my code, it is generating an explanation. So you can see the explanation. I will use the following tool and it chooses the image captioner tool and then it generates code and then it runs code and we can see an output here, a plate of food with eggs, bread and a cup of coffee, which is true. And now let's go back to the presentation and let's have a look at translation and audio generation tools. So it can translate to and from over 80 languages and it can voice the text easily. So I'm not going to spend much time on this slide, I'm going to show it to you. So a little side note about the translation. Under the hood, the translation is being curated by the meta. No language left behind an LLB model. And they claim that they have over approximate number of 200 of languages. But when I checked last time, not all of languages were available for translation. Some of them, at least with agents, some of them were generating errors. So I just passed a list of languages that worked for me and if you want to use a specific language you might want to check it before. But in my list you can see that there are approximately 80, 80 of languages, which is also, I believe that it is also good. So going back to our tools, we can run it together, so we can translate text and read it out loud just in one go, or we can run it separately. So first let's run it together just in one go. And you can see again, you can see an explanation, you can see a code generated by the agent. And you can see what's happening under the hood. And also if you want to build a chain of inputs, outputs. Don't forget to save your outputs as a variable so you can hand it as an input to next comment. And we can see the translated text here. I'm not going to read it out loud because it is in Spanish. Yes, but I have a tool that can read it out loud. And let's hear, how does it sound? Unplodo de kameta con huevos paniunataza de cafa. I'm not very proficient in Spanish, but to my mind it doesn't sound like Spanish. It sounds like an english version of Spanish or something like this. So maybe we can try to have it separately. So first doing translation and then doing voicing. So I will try to have my translated text as a variable here. And I'm trying to have audio. And let's try Plato de Komeda conjuevos panionitaza de cafa. No, I believe it's not very Spanish, but we will see what can we do here? And I promised you that we are going to talk about other predefined created set of tools. And Hagenphase provides you with several tools based on transformers, based on transformers models. And we can have different tools. Let's take a look at them. So the first it's document question answering tool. You can have a document in an image format and you can ask a question based on it. And under the hood, the transformers model which is used for it is donut. The next is text question answering. So you can have a long text paste in a text format, in a string format. And you can ask the question based on it. And the transformers model used for this task is flinty five. The next is image question answering. So you can pass an image and ask a question what's on this image? Or specific question on the image itself. And the transformer model operating by the hood is build. It's just for understanding what's going on. Next. We can have an image segmentation. We can output the segmentation tasks of the image. For example, detect animals or detect nature on the image. And the model is clip sag. Also we can have. In our example we had text to speech and we can have a reverse task. We can have speech and then translate it to text. And this model is transformer model is whisper. And we also can have a zero shoot text classification. If we don't have many labels for classification task, we can provide a text, we can provide a list of labels and try Bart to classify this text. And pretty straightforward, we can have text summarization so we can pass along text and have its summary in just two or three sentences. And this task is also apparated by Bart. And also we can have tools that are not based on transformers. That's why they are called transformers agnostic, because they can use different models or just ordinary Python script we might use in this text downloader tool. So we just can provide an URL and download the text. Advanced model is used under the hood. Yeah, just a python script which is going there and downloading. Pretty simple. And also we can have different flowers of stable diffusion to have text to image generation or image transformation like here, you can see an example. First we create an image and then we understood that we might want to have this image changed a little bit. So we ask agents to add a rock on this image. And also we can have a text to video generation which is also a flavor of stable diffusion here. So yes, and we can also, as I already said, we can also create a custom tool. And if we think about agent as an octopus, I believe it legs, yes, we can have an infinite octopus and we can add legs to this octopus. And these legs are our custom tools, so we can extend the possibility, the power of our agent. And also we can push this leg to the hub so other members can benefit from our interesting tool. And let's see how can it to. As you might remember, our voicing tool was not very powerful. And I'm going to use a Google text to speech library to generate audio based on text target tailored to specific language. And I already installed this library here and I'm importing it and I have a simple function using this library. I am passing a text as a variable here and also I'm passing a language as a second variable here. And I have an audio as the output. Pretty straightforward. And how can we wrap this function? How can we create a tool based on this function? So we will import class named tools from transformers and we can inherit it. And we can inherit our tool class from this tool parent class, how it will look like. So we are creating this class and we are passing a description name a name here. What's the name of tool like image captioner? Here we have Google voice in multiple languages which describes how it would work. But we have a description for agent to understand what this tool is for. So this is the name, this is a description, for example like hammer is for nails. And this is tools that can voice a word or a phrase in a given language and what does it take and what does it output. And here we should have description on our inputs format and description on our output format and we should have a call our function itself, how it would operate and we already can try it. So if I pass a language, yes, comida, comida. I believe it's foot in Spanish and it already sounds like a Spanish. And as you might notice, I'm not using Spanish as a language language name here, I'm using a language code and that's why I will be using len codes library to translate it from the human natural name of language to the language code. But first let's initiate, let's expand our octopus. Let's add this tool to our agent. So I'm going to reinitialize restartment agent and I will need OpenAI key once again and I will use len codes to translate to translate it from Spanish to sorry, target language will be Spanish here. Yes, it translates Spanish or some other language tool to language code compatible with our tool. And I am running, I'm going to run this comment as a prompt. Yeah, I'm going to give a prompt to our OpenAI agent and to ask to generate voicing of our text that we already obtained in previous steps to voice it to Spanish. Unplato de comida, congue natasa de casse. I would say it sounds more Spanish. Right. So here was a walkthrough, how to create your custom tool. And if you use a little bit of imagination you can think about expanding and using different tools, even funny tools like fetching image of cat from the Internet. So it can be pretty much everything that we can code in Python. So our agent can be really powerful with this. And this is the conclusion and final thoughts, a quick recap of our talk. Agents are still experimental, they still a little bit brittle for many use cases, especially when it comes to some complex prompting with different steps. So the output can be in some cases unexpected. So we need to practice in writing correct projects for using correct tools. But they are promising. As our agents are getting smarter, we can have a lot of tools. Agents are easily extendable and they are also easy to start with. As we saw, we can simply leverage an agent just in few lines of code and we can already use a great set of tools, predefined tools and they are very various and we can build some interesting chains and yeah, this what can make our agent really smart. And also some further ideas. If you like the idea of agent, you might want to experiment more with prompting, writing advanced prompts, and you can also bring your custom model as an agent. Try different OpenAI or just open source llms as an agent. Maybe they will get you better results. And also, if you like the overall concept of smart agents smart assistant, you can also check agents of launching or Amazon Bedrock. They also provide some capabilities of empowering an LLM to act on behalf of you. So yes, and as I already said, you can find my code, my slides and links used in this presentation in my GitHub repository huggingface underscore agents, and let's stay in touch. Thank you.

Darya Petrashka

Data Scientist @ SLB

Darya Petrashka's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways