Logistic regression is an algorithm that is used when the dependant variable is qualitative. A qualitative variable has a fixed number of values. An example could be transport type such as: train, road, rail or ship (4 levels). Algorithms that predict a qualitative variable are called classification algorithms.
Logistic regression is usually used to predict dichotomous variables. These are variables with two levels such as: true or false. However, logistic regression can be extended with a process called ‘one versus all classification’ to predict variables with more than two levels.
Logistic versus linear regression
Linear (or Ordinary Least Squares) regression is used to predict a quantitative variable. Linear regression could be used to predict a categorical variable if it is quantified as 1 or 0, however, there are two main problems associated with this approach.
Firstly, a linear function is unbounded and this means that there will be certain cases where the function predicts values greater than 1 and less than 0. Secondly, the slope of a linear function is calculated to minimize the sum of squares between the dependent variable observations and the regression line. This may be appropriate for a quantitative variable but the goal of classification technique is to minimise the number of misclassifications.
The Sigmoid function
The sigmoid function is guaranteed to produce an output between 0 and 1 – it asymptotes to these values as the function parameter approaches negative or positive infinity. The form of the function is shown below.
The equation above can be transformed into the following.
The expression on the right is the log of the odds-ratio, this is reason why logistic regression is sometimes abbreviated as logit (from ‘log odds’) regression. The odds ratio is used in betting. For example, if a horse is 5 to 1, then horse could be expected to lose 5 times in six races and win once.
Interpreting logistic regression’s coefficients
As can be seen in the diagram above, as the value of the regression equation increases (as measured on the x axis), the associated probability obtained also increases. This means that regression co-efficients which are relatively larger will increase the resultant probability. However, unlike linear regression, a one unit increase in Xj will not be associated with a βj increase in Y. This is because the slope of the logit curve is not constant and flattens for small or large values of p(x).
Interpreting logistic regression’s output
As shown above, logistic regression will output a probability ranging from 0 to 1. A specific probability will be associated with a specific set of independent variable values. But as the dependent variable takes only two values, each probability needs to be converted into a two-level discrete value. The threshold value determines whether a probability is assigned to true or false is typically set to 0.5. The R code for this conversion is shown below.
probs <- c(0.1, 0.3, 0.6. 0.8) ifelse(probs > 0.5, T, F)  FALSE FALSE TRUE TRUE
A worked example
As an example, a logistic regression was used to predict the incidence of heart disease within a region in South Africa. The data was obtained from here .The data consists of 462 observations with 9 features. The response variable indicates whether the person had heart disease.
Using k-fold cross-validation
In order to evaluate the model, the data set was partitioned into training data used to fit the model and test data which was used to evaluate the model. This process was repeated and 5 different training / test sets were generated. This process called is k-fold cross-validation (where in this case k = 5). This is shown diagrammatically below.’
Evaluating unbalanced classes
Classification algorithms may be used to predict the occurrence of reasonably exceptional events. For example, the heart data used in this example has 35% of observations with some type of heart disease and 65% without heart disease. So using the dominant class (i.e. no heart disease) to predict the outcome would result in an accuracy of 65%. But this approach would not be useful to identify people with a high likelihood of heart disease.
The following table shows a confusion matrix for a thresholds of 0.5 and 0.25. The values of the top left to bottom right diagonal are the correctly predicted values. For a threshold of 0.50, these equal (250 + 81) 331 observations or about 72%. This is better than the naïve predication of 65% but this general statistic does not consider how accurately the model predicts heart disease.
The number of correctly predicted incidences of heart disease are equal to 81 cases. But 160 people had heart disease. So, given the person had heart disease, there is a (81 / 160) 51% chance that they will be correctly identified. This measure is called ‘recall’
If the threshold value is decreased, there will be greater number of people predicted with heart disease and as a consequence, recall will increase. For example, if threshold is decreased to 0.25, then 129 people will be correctly predicted to have heart disease and the value of recall will increase to (129 / 16) 81%.
Trading off recall and precision
But the increase in recall comes at cost. In the table above, with a threshold of 0.5, there were (52 + 81) 133 people predicted with heart disease but only 81 or (81 / 133) 61% actually had heart disease. This percentage is called ‘precision’.
When the threshold is decreased to 0.25, the number of people predicted with heart disease increases to 266. This causes precision to decrease to (129 / 266) 48%.
What threshold value to choose?
The graph above, shows the tradeoff between recall and precision. If the costs of identifying heart disease are small, then it may be appropriate to lower the threshold value to increase recall. Precision will increase but given low costs of incorrectly predicting the presence of heart disease, this may be considered an acceptable trade-off.
Get the code from Github
You can get all the code for this post from this link.