Beyond semantic retrievers: leveraging LLMs for question-based document ranking

Video size:

Abstract

We improved complex document retrieval by using an LLM-driven pairwise ranking approach, prompting the model to judge which document better answers a question. This method outperformed traditional retrievers, boosting recall@k in sspecial retrieval scenarios.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Thank you for joining my talk. My name is Maximi Chatow. I work at Perlin in Munich as a technical consultant. And today I would like to present to you a short result of investigations that we did at our clients with regards to semantic retrievers or basically what is coming beyond it. Namely how we can not only leverage embedding models. Keyword based methods, but also LLMs that we prompt in order to get more satisfying results. With regards to retrievers, let's first look at the challenges of conventional retrieval setups. So what usually happens is that a user of a retrieval system, in order to get a set of documents out of a corpus forward, a query. That they want to query documents about to a retriever. This retriever then goes through a corpus with the help of a, some kind of search or embedding model and retrieves documents that fit to this query best. So in case of an embedding model, that would be with regards to semantic relatedness with keyword based models like BM 25. That would be. Kind of the congruency of of keywords between the query and documents. And then people end up with retrieve, with a set of retrieve documents with the amount that they like. And either they can use it with LLMs, but they can also, and this is the important case here for us not use it with LLMs, but for example, present it to the user. Use it with some downstream data application. The challenges that come with this setup are that you always need to take into account the entire query when you do the retrieval process. So you cannot focus on certain aspects or add in certain, like a certain focus on some part. It's always a black box that you cannot really control. With this conventional retriever setup. And also there's no reasoning process behind this. So that means that the order of the documents, or the one that is favored the most by the whole setup cannot be influenced by a reasoning process. And this might be of interest if you have some very specialized domain. Where either semantic models or keyword based models break down because stuff is too similar for these models. And then you would need a reasoning process to yeah, to find minute details in the differences between the documents and then use that as a basis for reordering. And we, at Perlin, were confronted with such a case that one of our clients recently. And we saw in the quality, the quantitative quality of the retrieval results that we cannot just make do with this kind of setup. We would need to use an LLM in the retrieval process to make it more fine grained and have more fine grain control. So what we did is that we used the setup that I just presented, so feeding a query to a retriever. Equipped with corpus and embedding model to only get preliminary documents. So with a classic retrieval process, you would get a preliminary set of documents, but then you would leverage an LLM to refine the set and refine the order and then end up with the retrieve documents that you're actually looking for. The nice thing about this is that it offers an additional text input dimension where you can refine aspects, you can add metadata to your documents at this stage that you want to be leveraged or that you want to be considered during the re-ranking process. So this enables us then to now use reasoning. So you can either use a normal LM or even a reasoning LLM. To reorder your documents and you can inject other aspects like metadata rules, for example. Also, that could not be captured by a classic non LM retrieval model. And now looking at, into the detail of how we actually did this we actually used the LLM not just as a one off re ranker, so putting all the documents into the prompt and then having it select from them. But we used the LLM as a binary document operator. So what we did is we put two documents into a prompt and a query, and then. Ask the LLM. This is the domain. Look at the following two documents. Look at the query. Which one did do you find more suitable to answering this query? Please give me an answer, either A or B, or left or right, or up or down. And what we end up with by doing this for all combinations of documents is a matrix that compares all the documents. By verdicts of the LLM. And then you can see, for example, that document two is more suited than document one. And document three is then less suited than document one. And with this you can establish an order of documents in the end. Especially if you use the LLM multiple times at a non-zero temperature, you can get some statistics to ground this ordering on, and then you can basically count the verdicts and order the documents by that. And we found in our project research that this is more reliable than a one of invocation, especially if you're dealing with many preliminary documents because you would end up potentially with a needle in the haystack problem. And documents would get lost in the LLM processing. And this way you have more control of this re-ranking process. But, and that is to be considered as well. Runtime and costs are of course, an issue here. So this takes way longer than just a one-off LLM comparison or even a classic retriever set up. Alone. And yeah, the incurring costs are of course also higher than if you just use an embedding model or even a keyword based model. Because you have to do many basically on the order of the square of the number of documents, LLM calls. So what is our take home message from this anecdotal anecdotal talk about one of our projects is that. We have seen that conventional retriever setups do have difficulties, especially with specialized corpora and retrieval tasks with special needs like certain aspects, certain focuses to to pay attention to. And we recommend LLMs as re-ran in this case in a binary comparison setup as I've just explained, because we have seen that with just comparing two documents. The results are quantitatively. Better than with one off LLM ranking setups. So thank you for listening and looking forward to the other talks. I.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Prompt Engineering 2025 - Online

November 06 2025 - premiere 5PM GMT

Beyond semantic retrievers: leveraging LLMs for question-based document ranking

Video size:

Abstract

Summary

Transcript

Slides

Maximilian Schattauer

Technical Consultant | Data & AI @ Perelyn GmbH

Join the community!

Featured event

2026

2025

Info

Conf42 Prompt Engineering 2025 - Online

November 06 2025 - premiere 5PM GMT

Beyond semantic retrievers: leveraging LLMs for question-based document ranking

Video size:

Abstract

Summary

Transcript

Slides

Maximilian Schattauer

Technical Consultant | Data & AI @ Perelyn GmbH

Join the community!