Fast Confidence Estimation for Classification and Regression models

Video size:

Abstract

Ensuring reliability in LLM predictions is challenging due to their probabilistic nature. This talk presents a fast, mathematically sound approach to evaluate model confidence in real-time using C++ compile-time numerical integration, optimizing AI inference reliability with minimal overhead.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Andres Stroganoff. I'd like to present you our joint work with and Andrea Bishan First, confidence estimation for classification and regression models. Let's talk about prediction confidence. Why do we need it and how to define it. The prediction quality estimation plays key role in model tuning, and there are many metrics to do such estimation. For instance, an F1 score, various area under the curve, metrics and so on. The important thing is that during model tuning, we usually have the taste in dataset, so the true values are known to us, and we can compare the predicted values with them. Sometimes the machine learning model is applied on a group basis. For instance, we want to detect the infected students within a class. In this case, we might be interested not only how well the model performs on individual samples, but also on the entire groups. But what if we don't have the true values for validating the predictions? Can we still estimate the prediction quality? You well accurately, no. But we can use the indirect quantitative information about predictions to tell if they look reasonable or maybe quite improbable as a motivating example. For this problem, we'll consider a machine learning method for profile guided optimization that we proposed in the following paper. Let's give a brief introduction to this method. The Android applications consist of Java Byte code of classes and methods, which can be compiled into a native representation, which usually leads to faster application startup and smoother performance in profile guided optimization. Not all classes and methods are compiled, but only those which are frequently used. They are called hot. Their indexes are listed in a special file, a profile in machine learning method for profile generation. A model predicts hot classes and methods and create an optimization profile for the application. There are cases when testing profiles are not available, so direct prediction validation is impossible. In this talk, we'll discuss a P value based method to indirectly estimate the prediction confidence in similar cases, and its efficient c plus implementation. So to which. Predictions do we trust? Consider the following example table. Here we see some user applications for Android. For each application, we see the number of methods in this application, and in the last column we see a number of methods which are predicted hot. How confident are we in these numbers? Is almost 8,000 hot methods looks okay for solitaire application with almost 300,000 methods, or does almost 10% of hot methods looks reasonable for sofa score, application, location. We need a metric to measure a confidence level in such cases. Our goals for this metric are first. It should be easily interpretable. For instance a value zero should mean that the result is very probable and one should mean that the result seems like correct. Again, we can't tell if predictions are correct if we don't have their true values, but we can say that statistically. The number of positive classes or other qualitative value seems probable or improbable, so the confidence is a measure of this probability. Our second goal is that the metric evaluation should be very time efficient, and the third goal is that the implementation should be easily verified. In our case, we needed the solution that won't require much testing to prevent stability and undefined behavior issues. To decide which metric to use, let's examine the real data distribution. In the table, we see the numbers of methods in applications from training set. We also see the actual number of hot methods and the ratio of the number of hot methods to the total number of methods for each application. On the picture, we see the ratio distributions. It is a block figure. Here we count the number of occurrences of ratio values within small segments. We see that it can be roughly approximated with normal distribution function, which is a blue line, but we could use more precise approximations. For instance the black line is the cube spine interpolation for confidence metric. It is natural to reflect the correspondence of measured value to the distribution. And now let's define a metric. First of all, let's recall what a probability distribution function is. It is a function which describes how probabilities are assigned to the possible outcomes of a random variable. There is also a cumulative distribution function, A CDF, which gives the probability that the random variable is less than or equal to a certain value. There are several useful properties of cumulative distribution function. It is non decreasing. It is bounded by zero and one, and it can be defined as the integral of probability distribution function. On the picture here, we see a CDF evaluated at 0.4. It is the area under the curve from minus infinity to 0.4, and also it is a probability that the randomly peak point. Will be less or equal to 0.4. We can use various probability distribution functions. The most common is the normal distribution, which is shown on the left. It is defined by two values. The mean value, which is the center of horizontal symmetry, and the standard deviation value, which shows how dispersed the values from the mean value. On the right picture, we see the cubic spine interpolation normalized so that the area under the curve would be approximately one s for any probability distribution function. Both of these approximations can be used for measuring confidence. To define a confidence level, we will use a p value based method, let TB a value for which we measure confidence. F small of X is a probability distribution function, and F capital of X is accumulative distribution function. Or C, D, F. Recall that CDF of X. Is a probability that a randomly peaked F distributed value is less or equal to T and one minus F capital of X is a probability that a random value is greater than T. When these probabilities are significantly different, then T is shifted away from the most of the values and should be considered unlikely to appear. On the other hand, when probability of finding the random value left of T is close to probability of finding it right of T, then T is approximately in the middle of the distribution. In this case, the confidence metric should give one. Of course, this is arguable because there are distributions for which this won't hold as reasonable definition, but we won't discuss them now. So let's define the confidence measure as doubled. Minimal probability of finding the random value left to or right to t. When these probabilities are equal, we will get a confidence level equal to one. So how confident were about prediction. The confidence here is evaluated using the normal distribution function interpolation of our experimental data. We see that the number of methods which are classified hot for Connect application seems about right, but the number of hot methods in download application seems very suspicious due to the exponential behavior. The confidence for Solitaire application is still considerably high, although it depends on what threshold is chosen. And now let's discuss the implementation of the confidence metric. How to approach the computation of fee of X. First of all, let's recall the goals. First, create a metric which evaluates a confidence level as a real number between zero and one where one is the most probable result and zero is the most improbable. We have done it using cumulative distribution function. Second metric evaluation should be time efficient, and third, implementation should be easily verified. Now let's talk about evaluating the confidence time efficiently. To compute accumulative distribution function of X, we'll use a numerical integration on evenly distributed grid with step size delta. The integration range is defined by probability distribution function. Let it be the segment from A to B. Thus, we consider probability distribution function to be negligible, close to zero out of this segment. To get the approximate value of confidence, we will store the values of cumulative distribution function in a pre-computed lookup table. It'll contain the values of the integral in the greed points. The value at first point is zero, and the value in the last point is approximately one, as it is the area under the curve of probability distribution function. The code below shows that acquiring of the confidence value is done in constant time. We just compute the index of the appropriate point in the grid and return the doubled, minimal value of the integral. In this point here, we provide a simplified code. Our c plus implementation is more general, and it is available on GitHub. Now let's discuss how to implement the lookup table with CDF values. There are several approaches with its pros and cons. First of all, we can generate a table when our application starts. This is fairly simple, but since it is done during the runtime, it may increase the application startup time. And also it requires additional unit testing. Another approach is to generate the values of CDF before the application is compiled and fill the area by hand. For instance, Scion paste the values computed in Python or somewhere else. This is very simple approach, but not flexible. For instance, if we decide to change some parameters of probability, distribution function, or the function itself, we will have to regenerate values and update the table in source code. This is very painful. The benefit of this approach is that the runtime code is very simple. We only have the lookup function. But in c plus, we can utilize the generic programming approach, which allows to run an algorithm during the compilation of a program so we can generate a table, not before or after the compilation, but during the compilation itself. This allows to add error checks to spot the issues before the application is started, so less unit testing is required. Also, since the CDF table will be generated at compile time, the runtime code will be as simple as we showed just to look up into a table. The great thing about generic programming approach is that we can plug in a custom distribution function, whether it is given by a formula or by piece wise approximation. The downside of this approach is that usually the compiled time evaluated code is a bit more complex and strict, but we'll have no problems with that, so we will use this approach. And now we will implement the lookup table generation and normal distribution function evaluation. First of all, accumulation distribution function requires computation of the integral. We will use the ZO rule approximation. We split the integration segment from A to B into a consecutive segments where area on each segment is interpolated by the area of trapezoid which is usually computed by the lens of each parallel sites. In the picture, we see the example of computing of one such area. The lengths of the sites are given by the values of the function at points X sub I and x sub i plus one. And the area is the mean of these lengths multiplied by the distance between the sides. The integral is approximated as the sum of areas of all OIDs by the Formula One. Note that each inner point is contained by exactly two OIDs. That's why the sum of the inner points is not divided by two. Computation of the CDF table is fairly simple. Here is our OID approximation of the integral and below is the quote, which computes values of CDF of I evaluating probability distribution function. Once for each G grid point, the first value of CDF is defined as zero. In variable sum, we will store the sum of PDF values in the inner points up to current point and a half of PDF value in the first point. Thus, to compute the CDF in current point, we take the sum variable and add to it the half of the value of PDF in current point and multiplied by Delta. In the last line, we just update the sum variable. And now let's see how a probability distribution function of for normal distribution can be computed in compiled time. It's a bit tricky. The function is given by the Formula two. We have no problems with computing the root of two PI because we just. Substitute its value as the constant. But in c plus 17, which is a popular standard for c plus, the exponent function is not a constant expression, and it can be used in compiled time computations. So we'll make our own a compiled time friendly exponent function. Let's recall the tailor expansion of E to the x. In Formula three, in the code below, we see a straightforward implementation of this summation. Four positive values of X. It adds terms consequently to variable E until next term is approximately zero. With an increasing, the factorial will eventually scale down any term to zero, but some TER terms may become quite large and make overflow, which will corrupt the sum. Thus, we will use this algorithm, four values of X, less than one, and for the greater values, we will use another approach. When absolute value of X is greater than zero, we will split it into a sum of two values, an integer part, a variable U, and the fractional part. A variable V. Since V is less than one, will already know how to compute E to the v. And for E to the U, we'll use the bin pole algorithm, which computes the integer power of the value in logarithmic time. The algorithm of computing E to the U is given below. So to compute E to the X for arbitrary X, we'll use the combination of these two algorithms and Formula four. All discussed algorithms can be evaluated in compiled time and the full source code is provided in GitHub. And to sum up, we discussed a confidence estimation approach, focus ification, and regression models based on P value score. We provided an implementation of the confidence evaluation with constant time complexity. The approach is general and can be applied to various data distributions, including the approximations of experimental data. I thank you for your attention. Have a nice day.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Fast Confidence Estimation for Classification and Regression models

Video size:

Abstract

Summary

Transcript

Slides

Andrei Stroganov

Expert software engineer @ Samsung Research

Andrei Visochan

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Fast Confidence Estimation for Classification and Regression models

Video size:

Abstract

Summary

Transcript

Slides

Andrei Stroganov

Expert software engineer @ Samsung Research

Andrei Visochan

Join the community!