# A Beginner’s Guide Bayesian Inference

*This article was published as a part of the Data Science Blogathon.*

## Introduction

Statistics is the study to help us quantify the way to measure uncertainty and hence, the concept of ‘Probability’ was introduced.

There are 3 different approaches available to determine the probability of an event.

**Classical****Frequentist****Bayesian**

Let’s understand the differences among these 3 approaches with the help of a simple example.

Suppose we’re rolling a fair six-sided die and we want to ask what is the probability that the die shows a four? Under the Classical framework, all the possible outcomes are equally likely i.e., they have equal probabilities or chances. Hence, answering the above question, there are six possible outcomes and they are all equally likely. So, the probability of a four on a fair six-sided die is just 1/6. This Classical approach works well when we have well-defined equally likely outcomes. But when things get a little subjective then it may become a little complex.

On the other hand, Frequentist definition requires us to have a hypothetical infinite sequence of a particular event and then to look at the relevant frequency in that hypothetical infinite sequence. In the case of rolling a fair six-sided die, if we roll it for the infinite number of times then 1/6th of the time, we will get a four and hence, the probability of rolling four in a six-sided die will be 1/6 under frequentist definition as well.

Now if we proceed a little further and ask if our die is fair or not. Under frequentist paradigm, the probability is either zero when it’s not a fair die and one if it is a fair die because under frequentist approach everything is measured from a physical perspective and hence, the die can be either fair or not. We cannot assign a probability to the fairness of the die. Frequentists are very objective in how they define probabilities but their approach cannot give intuitive answers for some of the deeper subjective issues.

Here comes the advantage of the Bayesian approach.

Bayesian perspective allows us to incorporate personal belief/opinion into the decision-making process. It takes into account what we already know about a particular problem even before any empirical evidence. Here we also have to acknowledge the fact my personal belief about a certain event may be different than others and hence, the outcome that we will get using the Bayesian approach may also be different.

For example, I may say that there is a 90% probability that it will rain tomorrow whereas my friend may say I think there is a 60% chance that it will rain tomorrow. So inherently Bayesian perspective is a subjective approach to probability, but it gives more intuitive results in a mathematically rigorous framework than the Frequentist approach. Let’s discuss this in detail in the following sections.

## What is Bayes’ Theorem?

Simplistically, Bayes’ theorem can be expressed through the following mathematical equation

where A is an event and B is evidence. So, P(A) is the prior probability of event A and P(B) is evidence of event B. Hence, P(B|A) is the likelihood. The denominator is a normalizing constant. So, Bayes’ Theorem gives us the probability of an event based on our prior knowledge of the conditions that might be related to the event and updates that conditional probability when some new information or evidence comes up.

Now let’s focus on the 3 components of the Bayes’ theorem

**• Prior**

**• Likelihood**

**• Posterior**

• * Prior Distribution –* This is the key factor in Bayesian inference which allows us to incorporate our personal beliefs or own judgements into the decision-making process through a mathematical representation. Mathematically speaking, to express our beliefs about an unknown parameter θ we choose a distribution function called the prior distribution. This distribution is chosen before we see any data or run any experiment.

**How do we choose a prior? **Theoretically, we define a cumulative distribution function for the unknown parameter θ. In basic context, events with the prior probability of zero will have the posterior probability of zero and events with the prior probability of one, will have the posterior probability of one. Hence, a good Bayesian framework will not assign a point estimate like 0 or 1 to any event that has already occurred or already known not to occur. A very handy widely used technique of choosing priors is using a family of distribution functions that is sufficiently flexible such that a member of the family will represent our beliefs. Now let’s understand this concept a little better.

i. * Conjugate Priors* – Conjugacy occurs when the final posterior distribution belongs to the family of similar probability density functions as the prior belief but with new parameter values which have been updated to reflect new evidence/ information. Examples Beta-Binomial, Gamma -Poisson or Normal-Normal.

ii. * Non-conjugate Priors* –Now, it is also quite possible that the personal belief cannot be expressed in terms of a suitable conjugate prior and for those cases simulation tools are applied to approximate the posterior distribution. An example can be Gibbs sampler.

iii. * Un-informative prior* – Another approach is to minimize the amount of information that goes into the prior function to reduce the bias. This is an attempt to have the data have maximum influence on the posterior. These priors are known as uninformative Priors but for these cases, the results might be pretty similar to the frequentist approach.

• * Likelihood* – Suppose θ is the unknown parameter that we are trying to estimate. Let’s represent fairness of a coin with θ. Now to check the fairness, we are flipping a coin infinitely and each time it is either appearing as ‘head’ or ‘tail’ and we are assigning a 1 or 0 value accordingly. This is known as the Bernoulli Trials. Probability of all the outcomes or ‘X’s taking some value of x given a value of theta. We’re viewing each of these outcomes as independent and hence, we can write this in product notation. This is the probability of observing the actual data that we collected (head or tail), conditioned on a value of the parameter theta (fairness of coin) and can be expressed as follows-

This is the concept of likelihood which is the density function thought of as a function of theta. To maximize the likelihood i.e., to make the event most likely to occur for the data we have, we will choose the theta that will give us the largest value of the likelihood. This is referred to as the maximum likelihood estimate or MLE. Additionally, a quick reminder is that the generalization of the Bernoulli when we have N repeated and independent trials is a binomial. We will see the application later in the article.

• * Posterior Distribution* – This is the result or output of the Bayes’ Theorem. A posterior probability is the revised or updated probability of an event occurring after taking into consideration new information. We calculate the posterior probability p(θ|X) i.e., how probable is our hypothesis about θ given the observed evidence.

## Mechanism of Bayesian Inference:

The Bayesian approach treats probability as a degree of beliefs about certain event given the available evidence. In Bayesian Learning, Theta is assumed to be a random variable. Let’s understand the Bayesian inference mechanism a little better with an example.

Inference example using Frequentist vs Bayesian approach: Suppose my friend challenged me to take part in a bet where I need to predict if a particular coin is fair or not. She told me “Well; this coin turned up ‘Head’ 70% of the time when I flipped it several times. Now I am giving you a chance to flip the coin 5 times and then you have to place your bet.” Now I flipped the coin 5 times and Head came up twice and tail came up thrice. At first, I thought like a frequentist.

So, θ is an unknown parameter which is a representation of fairness of the coin and can be defined as

**θ = {fair, loaded}**

Additionally, I assumed that the outcome variable X (whether head or tail) follows Binomial distribution with the following functional representation

Now in our case **n=5**.

Now my likelihood function will be

Now, I saw that head came up twice, so my X =2.

When X=2, f (θ|X=2) = 0.31 if θ =fair

= 0.13 if θ =loaded

Therefore, using the frequentist approach I can conclude that maximum likelihood i.e., MLE (theta hat) = fair.

Now comes the tricky part. If the question comes how sure am I about my prediction? I will not be able to answer that question perfectly or correctly as in a frequentist world, a coin is a physical object and hence, my probability can be either 0 or 1 i.e., the coin is either fair or not.

Here comes the role of Bayesian inference which will tell us the uncertainty in my prediction i.e., P(θ|X=2). The Bayesian inference allows us to incorporate our knowledge/information about the unknown parameter θ even before looking at any data. Here, suppose I know my friend pretty well and I can say with 90 % probability that she has given me a loaded coin.

Therefore, my prior P(loaded)=0.9. I can now update my prior belief with data and get the posterior probability using Bayes’ Theorem.

My numerator calculation will be as follows-

The denominator is a constant and can be calculated as the expression below. Please note that we are here basically summing up the expression over all possible values of θ which is only 2 in this case i.e., fair or loaded.

Hence, after replacing X with 2 we can calculate the Bayesian probability of the coin being loaded or fair. Do it yourself and let me know your answer! However, you will realize that this conclusion contains more information to make a bet than the frequentist approach.

**Application of Bayesian Inference in financial risk modeling: **

Bayesian inference has found its application in various widely used algorithms e.g., regression, Random Forest, neural networks, etc. Apart from that, it also gained popularity in several Bank’s Operational Risk Modelling. Bank’s operation loss data typically shows some loss events with low frequency but high severity. For these typical low-frequency cases, Bayesian inference turns out to be useful as it does not require a lot of data.

Earlier, Frequentist methods were used for operational risk models but due to its inability to infer about the parameter uncertainty, Bayesian inference was considered to be more informative as it has the capacity of combining expert opinion with actual data to derive the posterior distributions of the severity and frequency distribution parameters. Generally, for this type of statistical modeling, the bank’s internal loss data is divided into several buckets and the frequencies of each bucket loss are determined by expert judgment and then fitted into probability distributions.

Hello! I am Ananya. I have a degree in Economics and I have been working as a financial risk analyst for the last 5 years. I am also a voracious reader of Data Science Blogs just as you are. This is my first article for Analytics Vidhya. Hope you found this article useful.

https://www.linkedin.com/in/ananya-bhattacharyya-3b4b0685/

*The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.*

You can also read this article on our Mobile APP

### Related Articles