Source: Google

The bayes formula as we know is given by

Here, we assume the prior probabilities to be known. They can anyway be calculated based on the data that we have.

For a two class problem, the prior probabilities will be p(w1) & p(w2). The bayes formula gives us the output in the form of probabilities. If there are ’n’ classes that we are dealing with and we need to on the basis of the problem statement want to classify it, then using the above method we can get the probabilities of each class.

Let’s understand the significance of the equation.

p(wi | x )is known as the conditional probability. It denotes given a feature vector x it will form n conditional probabilities where i= 1,2,3,…n.

p(x | wi) is the likelihood function which denotes which feature x among the entire feature vector, say X belong to the class wi where i=1,2,3…n. p(x) is known as the evidence or the probability that an event has occurred.

Understanding why these methods are used is very important. Whenever we have data that cannot be processed on a large scale due to its complexity and no significant inferences can be made. Probability methods in these situations work very effectively.

The main goal of the process then becomes to model the data as a random process and analyse the uncertainty and the underlying patterns in the data.

A good example that can be given is tossing of a coin N times. Here we can have a probability model as

P(X = x ) where x = {0,1}

&

P(X =1 ) = p0 & P(X =0) = 1 — p0 where 1 denotes heads and 0 denotes tails.

This becomes a 2 class problem now where we have w1 and w2 as the classes (n=2). So to decide whether it is a head or a tail we can calculate the conditional probabilities and make a decision.

We can describe Bayesian approach like this:

Posterior α likelihood x prior

Another concept in the Bayesian decision theory that need to understood is the Gaussian Distribution. As most of the functions that we use on the data, there is assumption that the data is Gaussian. But what actually do we mean by Gaussian.

Gaussian distribution is also known as normal distribution. It is denoted by the equation as:

The above equation is for a single values real variable governed by two variables that is the mean and variance.

However, whenever we work with probabilities there is a chance that we do not get the desired output. This is called error. It can also be defined as the difference between the desired output and the actual output. We can describe it something like this for a two-class problem:

Our aim is to minimize the error so that our accuracy is high. We cannot ignore this part as when we use such models in the health care domain. The error being high can cost the life of a person too. Similarly, in business even a small change in the output accuracy can have huge impact on the business as they work with millions of dollars. That is the reason, the error factor or the risk must be minimized.

An intuitive way of understanding the error is the overlapping of the distribution functions. The more the overlapping more the error. When we say that we are minimising the error, we are actually the increasing the separation between the two distributions. You can refer the image below.

So if we have M regions, there will M distributions partitioning the feature space. But it happens to be that there might be two regions which are contiguous and for such situations they are separated by something called as decision surfaces. Looking from the mathematical point of view, it is sometimes easier not to work with probabilities due to its complexity with increase in features. Then we turn towards equivalent functions called as discriminant functions. The decision test now can be:

We use discriminant functions when the distributions are very cooperative that is they are Gaussian. If Gaussian, then their covariance is equal and hence it becomes optimal and computationally inexpensive.

To summarize:

The bayes theorem allows you to calculate the conditional probabilities called as the posterior probability.

The bayes classifier finds the most probable output classes using the training data.

The minimizing of the error or risk can be done in different ways depending the effect of the error on the output.

References:

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright $c 2001 by John Wiley & Sons, Inc.

https://www.informatik.uni-marburg.de/~eyke/publications/589.pdf

https://machinelearningmastery.com/maximum-a-posteriori-estimation

Section 6.7 Bayes Optimal Classifier Machine Learning — Tom Mitchell, 1997

https://www.robots.ox.ac.uk/~az/lectures/est/lect56.pdf

Machine Learning | AWS | Designer