# The theory of Logistic Regression and a Python implementation from scratch | by Mathematics Enthusiast | Sep, 2023

When studying about new ideas in utilized arithmetic and machine studying, I personally must have an excellent and deep understanding of the idea and all its intricacies with a view to grasp and grasp the topic totally. One of many first ideas I encountered when studying statistics and machine studying was Logistic Regression. There are numerous sources that specify the mannequin pretty properly and provides sensible makes use of and implementations. Nevertheless, I typically battle to search out sources that current the idea and instinct that led to the ideas we now know. Ergo the motivation behind this weblog. I’ll first begin by introducing the overall setting earlier than diving into the small print of the mannequin.

Classification is the try and discover a rule that predicts labels given inputs :

Now we may very well be attempting to be taught within the setting the place the labels are deterministic capabilities of the inputs, however a extra life like and normal illustration is the place each labels and inputs comply with an unknown joint likelihood distribution.

A classification algorithm makes an attempt to deduce a classification rule or a classifier from a set of i.i.d coaching knowledge. Ideally, we wish our classifier to generalise properly when given out of pattern knowledge.

The error could be generalised to all types of criterion or loss capabilities :

The classification rule that performs finest with respect to a particular loss operate is known as the Bayes rule and is as follows :

Proof of optimality :

By definition :

It follows that :

Thus :

Within the case of the 0â€“1 loss, which measures the likelihood of a misclassification, the Bayes classifier could be written as :

Which, if inspected rigorously, seems to be a reasonably intuitive outcome. We minimise the likelihood of misclassification by taking the more than likely label given the enter.

Sadly, the Bayes classifier is intractable except we’ve got direct entry to the underlying likelihood distribution.

With out lack of generality, we’ll any further work in a binary classification setting the place the cardinality of the labels set is 2. For comfort, we’ll assume that the labels are 0 and 1.

In Logistic Regression, we try and approximate the Bayes rule from a category of linear classifiers. We do that by introducing the next assumption on the joint likelihood distribution of inputs and labels :

The earlier assumption is equal to saying :

All knowledge factors within the higher hyperplane of the linear resolution boundary will probably be categorised 1, and different factors 0, since :

The hat over the classifier symbolizes that itâ€™s the result of our algorithm.

The sigmoid operate is a really handy method to introduce a linear resolution boundary, since modeling the conditional likelihood above straight with a linear operate could result in contradictions since it might take values outdoors of the vary [0, 1].

However how can we select this linear boundary?

Many studying algorithms depend on empirical danger minimisation, the place we approximate the generalisation error by the empirical imply of the loss criterion over the coaching set and select a classifier that minimises this empirical danger :

The speculation class, is the category of classifiers we prohibit our search to, in our case :

On this case, the ERM technique is very impractical as there isn’t any closed method, and the indicator operate is under no circumstances appropriate for gradient descent and different numerical optimisation strategies.

Most Probability :

The probability of a set of observations, on this case the coaching set, in a parametric mannequin just like the one above is the likelihood of observing such a set given an underlying parameter. Utilizing the i.i.d nature of the coaching set we discover :

Because the title suggests, we choose the parameter that maximises the likelihood of observing the coaching knowledge. For numerical comfort, it’s preferable to work with the logLikelihood :

The alternative of final expression can also be properly often known as the cross-entropy loss.

Gradient descent :

I can’t go an excessive amount of in-depth concerning the intricacies of gradient descent since it isn’t the topic of this weblog, however briefly, it’s an iterative technique that makes an attempt to search out the minimal of a operate by taking small steps within the course reverse to that of the operate’s gradient.

The place the parameter gamma is known as the training price, and is the magnitude of the step taken at every iteration.

We’re searching for :

The gradient of this goal operate is :

Now letâ€™s code this up !