Logistic Regression


Logistic regression is an algorithm that is used when the dependant variable is qualitative. A qualitative variable has a fixed number of values. An example could be transport type such as: train, road, rail or ship (4 levels).  Algorithms that predict a qualitative variable are called classification algorithms.

Sigmoid function

Unlike a linear function, the Sigmoid function is bounded at 0.00 and 1.00. Also, unlike a linear function, the slop is not constant. Therefore, care needs to be taken when interpreting logistic regression’s coefficients.

Logistic regression is usually used to predict dichotomous variables. These are variables with two levels such as: true or false.  However, logistic regression can be extended with a process called ‘one versus all classification’ to predict variables with more than two levels.

Logistic versus linear regression

Linear (or Ordinary Least Squares) regression is used to predict a quantitative  variable. Linear regression could be used to predict a categorical variable if it is quantified as 1 or 0, however, there are two main problems associated with this approach.

Firstly, a linear function is unbounded and this means that there will be certain cases where the function predicts values greater than 1 and less than 0. Secondly, the slope of a linear function is calculated to minimize the sum of squares between the dependent variable observations and the regression line. This may be appropriate for a quantitative variable but the goal of classification technique is to minimise the number of misclassifications.

The Sigmoid function

The sigmoid function is guaranteed to produce an output between 0 and 1 – it asymptotes to these values as the function parameter approaches negative or positive infinity. The form of the function is shown below.


The equation above can be transformed into the following.


The expression on the right is the log of the odds-ratio, this is reason why logistic regression is sometimes abbreviated as logit (from ‘log odds’)  regression.  The odds ratio is used in betting. For example, if a horse is 5 to 1, then horse could be expected to lose 5 times in six races and win once.


Interpreting logistic regression’s coefficients

As can be seen in the diagram above, as the value of the regression equation increases (as measured on the x axis), the associated probability obtained also increases.  This means that regression co-efficients which are relatively larger will increase the resultant probability. However, unlike linear regression, a one unit increase in Xj will not be associated with a βj increase in Y.  This is because the slope of the logit curve is not constant and flattens for small or large values of p(x).

Interpreting logistic regression’s output

As shown above, logistic regression will output a probability ranging from 0 to 1. A specific probability will be associated with a specific set of independent variable values.  But as the dependent variable takes only two values, each probability needs to be converted into a two-level discrete value. The threshold value determines whether a probability is assigned to true or false is typically set to 0.5.  The R code for this conversion is shown below.

probs <- c(0.1, 0.3, 0.6. 0.8)
ifelse(probs > 0.5, T, F)

A worked example

As an example, a logistic regression was used to predict the incidence of heart disease within a region in South Africa. The data was obtained from here  .The data consists of 462 observations with 9 features. The response variable indicates whether the person had heart disease. 

Using k-fold cross-validation

In order to evaluate the model, the data set was partitioned into training data used to fit the model and test data which was used to evaluate the model. This process was repeated and 5 different training / test sets were generated. This process called is k-fold cross-validation (where in this case k = 5). This is shown diagrammatically below.’

kFold - Cross Validation

K-fold cross validation creates k different test partitions from the original data set. A model is fitted on a training set and an associated test statistic is calculated. The test statistic is then averaged over the k-folds.

Evaluating unbalanced classes

Classification algorithms may be used to predict the occurrence of reasonably exceptional events.  For example, the heart data used in this example has 35% of observations with some type of heart disease and 65% without heart disease. So using the dominant class (i.e. no heart disease) to predict the outcome would result in an accuracy of 65%.  But this approach would not be useful to identify people with a high likelihood of heart disease.

The following table shows a confusion matrix for a thresholds of 0.5 and 0.25. The values of the top left to bottom right diagonal are the correctly predicted values.  For a threshold of 0.50, these equal (250 + 81) 331 observations or about 72%. This is better than the naïve predication of 65% but this general statistic does not consider how accurately the model predicts heart disease.



The number of correctly predicted incidences of heart disease are equal to 81 cases. But 160 people had heart disease. So, given the person had heart disease, there is a (81 / 160) 51% chance that they will be correctly identified. This measure is called ‘recall’

If the threshold value is decreased, there will be greater number of people predicted with heart disease and as a consequence, recall will increase. For example, if threshold is decreased to 0.25, then 129 people will be correctly predicted to have heart disease and the value of recall will increase to (129 / 16) 81%.

Trading off recall and precision

But the increase in recall comes at cost. In the table above, with a threshold of 0.5, there were (52 + 81) 133 people predicted with heart disease but only 81 or (81 / 133) 61% actually had heart disease. This percentage is called ‘precision’.

When the threshold is decreased to 0.25, the number of people predicted with heart disease increases to 266. This causes precision to decrease to (129 / 266) 48%.

Graph of precision versus recall

This graph shows total accuracy, recall and precision as the threshold (x axis) is varied. Total accuracy is maximised when the threshold is 0.5. But this does not maximise either precision or threshold. The appropriate threshold value depends on costs associated with false positives and false negatives.

What threshold value to choose?

The graph above, shows the tradeoff between recall and precision.  If the costs of identifying heart disease are small, then it may be appropriate to lower the threshold value to increase recall. Precision will increase but given low costs of incorrectly predicting the presence of heart disease, this may be considered an acceptable trade-off.

Get the code from Github

You can get all the code for this post from this link.