Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              In this presentation, we'll be deep diving into probabilistic programming,
            
            
            
              which is a powerful modeling approach that addresses
            
            
            
              critical challenges in traditional machine learning and AI techniques.
            
            
            
              We'll explore how probabilistic programming enables us
            
            
            
              to embrace uncertainty, incorporate expert
            
            
            
              knowledge, and enhance transparency in the
            
            
            
              decision making. Finally, I will present the implementation of
            
            
            
              probabilistic models in Python.
            
            
            
              Why do we need probabilistic programming and
            
            
            
              what does it has to offer? What does it offer over other
            
            
            
              machine learning and AI techniques? So the fundamental challenge
            
            
            
              in conventional machine learning and AI techniques is the lack of
            
            
            
              uncertainty quantification. These models
            
            
            
              typically provide point estimates without accounting for
            
            
            
              the uncertainty surrounding their predictions. This limitation
            
            
            
              hampers our ability to assess the reliability of
            
            
            
              model and undermines our confidence in the
            
            
            
              decision making process. The second challenge we
            
            
            
              face is that machine learning models are data hungry and often
            
            
            
              require correctly labeled data, and these
            
            
            
              models tend to struggle with problems where
            
            
            
              data is limited. Conventional machine learning
            
            
            
              and AI techniques lack of framework to encode
            
            
            
              expert domain knowledge or prior beliefs
            
            
            
              into the model. So without the ability to leverage
            
            
            
              domain specific insights, the model might
            
            
            
              overlook crucial nuances in data and tend
            
            
            
              to not perform up to its potential.
            
            
            
              Lastly, machine learning models are becoming more and more complex
            
            
            
              and opaque, while public demands for more transparency
            
            
            
              and accountability on decisions being derived from data
            
            
            
              and AI. So all of this presents
            
            
            
              a need for a modeling framework,
            
            
            
              encode expert knowledge, work with limited
            
            
            
              data, provide predictions along with
            
            
            
              associated uncertainty, and provide
            
            
            
              models or enables models which offer more transparency
            
            
            
              and explainability. Probabilistic programming
            
            
            
              emerges as a game changer, so to understand probabilistic
            
            
            
              programming, it is essential to grasp bayesian statistics.
            
            
            
              How bayesian statistics differ from the classical
            
            
            
              frequentist approach. In frequentist
            
            
            
              statistics, model parameters are treated as fixed
            
            
            
              quantities, and uncertainty in the
            
            
            
              parameter estimation is typically addressed through techniques
            
            
            
              such as conference intervals.
            
            
            
              However, frequentist methods do not assign probability
            
            
            
              distribution to parameters, and their interpretation
            
            
            
              of uncertainty is rooted in the long run frequency
            
            
            
              properties of the estimators rather than explicit probabilistic
            
            
            
              statements about the parameter values, while in
            
            
            
              contrast, bayesian statistics. In bayesian statistics,
            
            
            
              unknown model parameters are treated as random variables and
            
            
            
              are modeled using probability distribution.
            
            
            
              So this approach inherently captures uncertainty within the
            
            
            
              parameters themselves, and hence this framework
            
            
            
              offers a more intuitive and a flexible approach to quantify
            
            
            
              uncertainty. How does Bayesian statistics
            
            
            
              work? Bayesian statistical methods use
            
            
            
              Bayesian theorem to compute and update probabilities
            
            
            
              as you obtain new data. This is a simple but
            
            
            
              a powerful equation. What we start with is the
            
            
            
              prior belief, right? So what's the prior belief or the prior distribution for the
            
            
            
              unknown parameter likelihood represents
            
            
            
              the information. The new information represents
            
            
            
              our updated belief about this unknown parameter,
            
            
            
              which incorporates both prior knowledge and observed evidence.
            
            
            
              The term in denominator marginal likelihood is more of a normalizing
            
            
            
              constant, making sure that posterior also represents a probability distribution.
            
            
            
              Now let's look at how inference happens with
            
            
            
              bayesian versus non bayesian models. So we'll start with non
            
            
            
              bayesian inference, and then we'll go to Bayesian inference. So in case of Bayesian
            
            
            
              inference, what we do is we determine the value of
            
            
            
              unknown a point estimate of the unknown parameter which
            
            
            
              maximizes the likelihood of data. So likelihood is
            
            
            
              given the unknown parameter. So we defined the parameter
            
            
            
              which maximizes the likelihood of evidence, and it comes as a single point estimate.
            
            
            
              And for a new instance, we predict only using that
            
            
            
              point estimate. While in case of Bayesian inference,
            
            
            
              we start with our prior belief about this parameter, about this unknown parameter,
            
            
            
              which here is represented as p theta, and then we compute
            
            
            
              posterior distribution, which is p theta given evidence. So it's
            
            
            
              an updated distribution about the unknown parameter
            
            
            
              given our prior, starting from our prior
            
            
            
              and given the new data set. So now for
            
            
            
              a new instance, you compute the probability of
            
            
            
              the new instance considering the entire posterior distribution rather
            
            
            
              than a single point estimate.
            
            
            
              So this simple implementation
            
            
            
              is a lot more complex. In practice,
            
            
            
              the integral here tends to be interactable,
            
            
            
              especially when we work with higher on a higher
            
            
            
              dimension parameter space.
            
            
            
              There's no closed form solution to get this posterior distribution.
            
            
            
              So what do we do in that scenario? Right? So if we can't get a
            
            
            
              closed form solution, can we get samples from the posterior distribution?
            
            
            
              Right. So if we are able to sample from this posterior distribution, we effectively
            
            
            
              have its posterior distribution.
            
            
            
              So the whole idea is if we can sample from this posterior,
            
            
            
              and then we can use that samples to get
            
            
            
              inference for a new instance along with the associated,
            
            
            
              as we touched earlier, p of y,
            
            
            
              which is the normalizing constant.
            
            
            
              Normalizing constant, which involves integrals, is generally
            
            
            
              not interactable, and then
            
            
            
              we don't really have a closed form solution.
            
            
            
              Numerical integration techniques also tend to be too computationally
            
            
            
              intensive here. How do we sample from here? Right. How do we sample the posterior?
            
            
            
              So for this, we rely on a special class of
            
            
            
              algorithms called Markov chain Monte Carlo methods,
            
            
            
              through which we are able to sample from a probability distribution.
            
            
            
              So if we're able to construct a Markov chain that has the
            
            
            
              desired distribution as its equilibrium distribution, one can obtain
            
            
            
              samples for the desired distribution by recording states
            
            
            
              from this Markov chain. The different MCMC samplers
            
            
            
              here. So you have metropoliscape sampling and so on,
            
            
            
              which can help you which can help generate samples
            
            
            
              from this distribution. So now going
            
            
            
              back and explaining what is probabilistic programming or probabilistic modeling.
            
            
            
              So probabilistic programming is merely a programming framework for bayesian
            
            
            
              statistics. It inherently uncertainty
            
            
            
              within its parameters. So it tends to thrive in a world of uncertainty.
            
            
            
              And as you define your prior beliefs, you built
            
            
            
              in your model, your prior beliefs or the expert domain knowledge. So it tends
            
            
            
              to work well with little data as well.
            
            
            
              And it can be updated.
            
            
            
              Your distribution can be updated as you get more and more new
            
            
            
              information. And the whole model architecture offers transparency
            
            
            
              and more explainable models.
            
            
            
              Now a bit about workflow of probabilistic programming.
            
            
            
              So the first step is we identify all the unknown parameters.
            
            
            
              We define the prior distribution, and while defining the prior
            
            
            
              distribution, we encode our prior belief or we encode
            
            
            
              our expert knowledge, expert domain knowledge
            
            
            
              about the model parameters. Then we specify the
            
            
            
              likelihood, which is the probability distribution of observed data
            
            
            
              as a function of unknown quantities. And then we can run a suitable
            
            
            
              MCMC sampler to get
            
            
            
              the posterior distribution for all of these unknown parameters.
            
            
            
              And now for any new instance, now we have, instead of point
            
            
            
              estimates, we have a distribution, we have the entire distribution for the
            
            
            
              unknown parameters, and we can utilize that distribution to
            
            
            
              compute the estimate along with this
            
            
            
              uncertainty for a new instance.
            
            
            
              So now a quick demo on implementing
            
            
            
              bayesian models or probabilistic models in python.
            
            
            
              For my demo, I'm using a data set
            
            
            
              in which from
            
            
            
              a sample of population, I have height, weight and gender.
            
            
            
              So gender here is in binary form whether the candidate
            
            
            
              is female or not, and then you have the height and weight of that
            
            
            
              candidate. So in
            
            
            
              a non probabilistic world, we will try
            
            
            
              to fit a logistic regression model here. And for bayesian
            
            
            
              model I'll be using. So we
            
            
            
              start with a simple logistic model with your
            
            
            
              is female flag as the target,
            
            
            
              and then we try to find the coefficients of height and weight,
            
            
            
              which best fit our problem.
            
            
            
              Right? So in this case, you run a linear model, a logistic regression
            
            
            
              model, and then you get coefficients of that logistic regression model.
            
            
            
              And again, you're only getting point estimates, you are not getting
            
            
            
              an estimate that what is the range of this parameter
            
            
            
              and what's the underlying uncertainty? Associated model
            
            
            
              coefficients, right? So the next thing I'm going to do is move
            
            
            
              on to running a bayesian model. So for bayesian model, as we
            
            
            
              discussed earlier, we start with defining the unknown parameters.
            
            
            
              We define the prior distribution, we define likelihood, and then
            
            
            
              we run an MCMC sampler. So, Stan, it's its
            
            
            
              own language. First thing I have done is
            
            
            
              I have built a Stan model. So I'll just give a
            
            
            
              quick glimpse of that model. So Stan
            
            
            
              has a couple of modules. So you have your data
            
            
            
              transform data parameters, transform parameters model and generated quantities.
            
            
            
              Since that simple model, I'm just models here.
            
            
            
              So data is where you define the structure of your data,
            
            
            
              data types, parameters where
            
            
            
              you define the data types of
            
            
            
              your unknown parameters. And then we come to the moderate part. In the moderate part,
            
            
            
              the first thing I do is I start with my priors.
            
            
            
              So what are the prior beliefs? I hold about
            
            
            
              the three coefficients, your interceptor,
            
            
            
              your coefficient for weight, and your coefficient for height. And then here is
            
            
            
              the part where I have defined likelihood,
            
            
            
              right? And again, the target metric here is binary
            
            
            
              Bernali. So we fit a Bernoulli logic Bernoulli likelihood
            
            
            
              here. That's it, that's how you define your stand model.
            
            
            
              Then I can take this model into Python,
            
            
            
              run the compiler for this model. The data
            
            
            
              which needs to be fed to stan needs to be in form of a dictionary.
            
            
            
              So this data is just being transformed. We run
            
            
            
              the MCMC sampler, and then you get a
            
            
            
              series of estimates or samples for
            
            
            
              each of the unknown parameters. And then you can use those,
            
            
            
              you can look at the mean median models and
            
            
            
              other centrality measures for your parameters.
            
            
            
              And along with that you can also understand, okay, what's the standard deviation,
            
            
            
              what's the range? So this gives you a lot more information about
            
            
            
              your coefficients rather than giving you point estimates, right? And then subsequently,
            
            
            
              as I perform predictions, instead of considering
            
            
            
              just a point estimate, I can consider the entire distribution of my parameters,
            
            
            
              unknown quantities, right? Or my coefficients here,
            
            
            
              that can help us get a better prediction
            
            
            
              and along with that get the associated uncertainty with
            
            
            
              the prediction as well. So that's about it.
            
            
            
              If you go onto Stan website, it has a lot of detailed documentation
            
            
            
              and you can find out how you can build more complex stand
            
            
            
              models as well. So that's about it from, in terms of my presentation.
            
            
            
              Thank you everyone.