Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Andres Stroganoff.
I'd like to present you our joint work with and Andrea Bishan
First, confidence estimation for classification and regression models.
Let's talk about prediction confidence.
Why do we need it and how to define it.
The prediction quality estimation plays key role in model tuning, and there
are many metrics to do such estimation.
For instance, an F1 score, various area under the curve, metrics and so on.
The important thing is that during model tuning, we usually have the
taste in dataset, so the true values are known to us, and we can compare
the predicted values with them.
Sometimes the machine learning model is applied on a group basis.
For instance, we want to detect the infected students within a class.
In this case, we might be interested not only how well the model
performs on individual samples, but also on the entire groups.
But what if we don't have the true values for validating the predictions?
Can we still estimate the prediction quality?
You well accurately, no.
But we can use the indirect quantitative information about predictions to tell
if they look reasonable or maybe quite improbable as a motivating example.
For this problem, we'll consider a machine learning method for
profile guided optimization that we proposed in the following paper.
Let's give a brief introduction to this method.
The Android applications consist of Java Byte code of classes and
methods, which can be compiled into a native representation, which
usually leads to faster application startup and smoother performance
in profile guided optimization.
Not all classes and methods are compiled, but only those which are frequently used.
They are called hot.
Their indexes are listed in a special file, a profile in machine learning
method for profile generation.
A model predicts hot classes and methods and create an optimization
profile for the application.
There are cases when testing profiles are not available, so direct
prediction validation is impossible.
In this talk, we'll discuss a P value based method to indirectly estimate the
prediction confidence in similar cases, and its efficient c plus implementation.
So to which.
Predictions do we trust?
Consider the following example table.
Here we see some user applications for Android.
For each application, we see the number of methods in this application, and
in the last column we see a number of methods which are predicted hot.
How confident are we in these numbers?
Is almost 8,000 hot methods looks okay for solitaire application with almost
300,000 methods, or does almost 10% of hot methods looks reasonable for
sofa score, application, location.
We need a metric to measure a confidence level in such cases.
Our goals for this metric are first.
It should be easily interpretable.
For instance a value zero should mean that the result is very probable and one should
mean that the result seems like correct.
Again, we can't tell if predictions are correct if we don't have their true
values, but we can say that statistically.
The number of positive classes or other qualitative value seems probable
or improbable, so the confidence is a measure of this probability.
Our second goal is that the metric evaluation should be very time
efficient, and the third goal is that the implementation should be easily verified.
In our case, we needed the solution that won't require much testing to prevent
stability and undefined behavior issues.
To decide which metric to use, let's examine the real data distribution.
In the table, we see the numbers of methods in applications from training set.
We also see the actual number of hot methods and the ratio of the number
of hot methods to the total number of methods for each application.
On the picture, we see the ratio distributions.
It is a block figure.
Here we count the number of occurrences of ratio values within small segments.
We see that it can be roughly approximated with normal distribution
function, which is a blue line, but we could use more precise approximations.
For instance the black line is the cube spine interpolation for confidence metric.
It is natural to reflect the correspondence of measured
value to the distribution.
And now let's define a metric.
First of all, let's recall what a probability distribution function is.
It is a function which describes how probabilities are assigned to the
possible outcomes of a random variable.
There is also a cumulative distribution function, A CDF, which gives the
probability that the random variable is less than or equal to a certain value.
There are several useful properties of cumulative distribution function.
It is non decreasing.
It is bounded by zero and one, and it can be defined as the integral of
probability distribution function.
On the picture here, we see a CDF evaluated at 0.4.
It is the area under the curve from minus infinity to 0.4, and also it is a
probability that the randomly peak point.
Will be less or equal to 0.4.
We can use various probability distribution functions.
The most common is the normal distribution, which is shown on the left.
It is defined by two values.
The mean value, which is the center of horizontal symmetry, and the standard
deviation value, which shows how dispersed the values from the mean value.
On the right picture, we see the cubic spine interpolation normalized
so that the area under the curve would be approximately one s for any
probability distribution function.
Both of these approximations can be used for measuring confidence.
To define a confidence level, we will use a p value based method, let TB a
value for which we measure confidence.
F small of X is a probability distribution function, and F capital of X is
accumulative distribution function.
Or C, D, F. Recall that CDF of X. Is a probability that a randomly
peaked F distributed value is less or equal to T and one minus F
capital of X is a probability that a random value is greater than T.
When these probabilities are significantly different, then T is shifted away
from the most of the values and should be considered unlikely to appear.
On the other hand, when probability of finding the random value left of T
is close to probability of finding it right of T, then T is approximately
in the middle of the distribution.
In this case, the confidence metric should give one.
Of course, this is arguable because there are distributions for which this
won't hold as reasonable definition, but we won't discuss them now.
So let's define the confidence measure as doubled.
Minimal probability of finding the random value left to or right to t.
When these probabilities are equal, we will get a confidence level equal to one.
So how confident were about prediction.
The confidence here is evaluated using the normal distribution function
interpolation of our experimental data.
We see that the number of methods which are classified hot for Connect
application seems about right, but the number of hot methods in download
application seems very suspicious due to the exponential behavior.
The confidence for Solitaire application is still considerably high, although
it depends on what threshold is chosen.
And now let's discuss the implementation of the confidence metric.
How to approach the computation of fee of X. First of all, let's recall the goals.
First, create a metric which evaluates a confidence level as a
real number between zero and one where one is the most probable result
and zero is the most improbable.
We have done it using cumulative distribution function.
Second metric evaluation should be time efficient, and third,
implementation should be easily verified.
Now let's talk about evaluating the confidence time efficiently.
To compute accumulative distribution function of X, we'll use a numerical
integration on evenly distributed grid with step size delta.
The integration range is defined by probability distribution function.
Let it be the segment from A to B. Thus, we consider probability
distribution function to be negligible, close to zero out of this segment.
To get the approximate value of confidence, we will store the values
of cumulative distribution function in a pre-computed lookup table.
It'll contain the values of the integral in the greed points.
The value at first point is zero, and the value in the last point is approximately
one, as it is the area under the curve of probability distribution function.
The code below shows that acquiring of the confidence value is done in constant time.
We just compute the index of the appropriate point in the
grid and return the doubled, minimal value of the integral.
In this point here, we provide a simplified code.
Our c plus implementation is more general, and it is available on GitHub.
Now let's discuss how to implement the lookup table with CDF values.
There are several approaches with its pros and cons.
First of all, we can generate a table when our application starts.
This is fairly simple, but since it is done during the runtime, it may
increase the application startup time.
And also it requires additional unit testing.
Another approach is to generate the values of CDF before the application
is compiled and fill the area by hand.
For instance, Scion paste the values computed in Python or somewhere else.
This is very simple approach, but not flexible.
For instance, if we decide to change some parameters of probability, distribution
function, or the function itself, we will have to regenerate values
and update the table in source code.
This is very painful.
The benefit of this approach is that the runtime code is very simple.
We only have the lookup function.
But in c plus, we can utilize the generic programming approach, which allows to
run an algorithm during the compilation of a program so we can generate a table,
not before or after the compilation, but during the compilation itself.
This allows to add error checks to spot the issues before the application is
started, so less unit testing is required.
Also, since the CDF table will be generated at compile time, the
runtime code will be as simple as we showed just to look up into a table.
The great thing about generic programming approach is that we
can plug in a custom distribution function, whether it is given by a
formula or by piece wise approximation.
The downside of this approach is that usually the compiled time evaluated
code is a bit more complex and strict, but we'll have no problems with
that, so we will use this approach.
And now we will implement the lookup table generation and normal
distribution function evaluation.
First of all, accumulation distribution function requires
computation of the integral.
We will use the ZO rule approximation.
We split the integration segment from A to B into a consecutive
segments where area on each segment is interpolated by the area of
trapezoid which is usually computed by the lens of each parallel sites.
In the picture, we see the example of computing of one such area.
The lengths of the sites are given by the values of the function at
points X sub I and x sub i plus one.
And the area is the mean of these lengths multiplied by
the distance between the sides.
The integral is approximated as the sum of areas of all OIDs by the Formula One.
Note that each inner point is contained by exactly two OIDs.
That's why the sum of the inner points is not divided by two.
Computation of the CDF table is fairly simple.
Here is our OID approximation of the integral and below is the quote, which
computes values of CDF of I evaluating probability distribution function.
Once for each G grid point, the first value of CDF is defined as zero.
In variable sum, we will store the sum of PDF values in the inner
points up to current point and a half of PDF value in the first point.
Thus, to compute the CDF in current point, we take the sum variable and add
to it the half of the value of PDF in current point and multiplied by Delta.
In the last line, we just update the sum variable.
And now let's see how a probability distribution function
of for normal distribution can be computed in compiled time.
It's a bit tricky.
The function is given by the Formula two.
We have no problems with computing the root of two PI because we just.
Substitute its value as the constant.
But in c plus 17, which is a popular standard for c plus, the exponent function
is not a constant expression, and it can be used in compiled time computations.
So we'll make our own a compiled time friendly exponent function.
Let's recall the tailor expansion of E to the x. In Formula three,
in the code below, we see a straightforward implementation
of this summation.
Four positive values of X. It adds terms consequently to variable E
until next term is approximately zero.
With an increasing, the factorial will eventually scale down any
term to zero, but some TER terms may become quite large and make
overflow, which will corrupt the sum.
Thus, we will use this algorithm, four values of X, less than
one, and for the greater values, we will use another approach.
When absolute value of X is greater than zero, we will split it into a
sum of two values, an integer part, a variable U, and the fractional part.
A variable V. Since V is less than one, will already know
how to compute E to the v.
And for E to the U, we'll use the bin pole algorithm, which computes the integer
power of the value in logarithmic time.
The algorithm of computing E to the U is given below.
So to compute E to the X for arbitrary X, we'll use the combination of these
two algorithms and Formula four.
All discussed algorithms can be evaluated in compiled time and the
full source code is provided in GitHub.
And to sum up, we discussed a confidence estimation approach, focus ification, and
regression models based on P value score.
We provided an implementation of the confidence evaluation
with constant time complexity.
The approach is general and can be applied to various data distributions, including
the approximations of experimental data.
I thank you for your attention.
Have a nice day.