Making Receiver Operating Characteristic (ROC) curve in R

infoart.ca
3 min readOct 9, 2023

--

The ROC curve is a graphical representation that illustrates the performance of a binary classifier across different classification thresholds. It is widely used in machine learning and statistics to evaluate the predictive power of a model and determine the trade-off between sensitivity and specificity.

To demonstrate the ROC curve analysis, we will use the mtcars dataset, which contains information about various car models. We will build a logistic regression model to predict whether a car has high or low fuel efficiency based on its characteristics. We will be using the pROC package for ROC curve analysis.

library(pROC)

Let’s start by loading the mtcars dataset and exploring its structure:

data(mtcars)

The mtcars dataset contains various variables related to car specifications, including the dependent variable “mpg” (miles per gallon). For the purpose of this analysis, let’s convert the “mpg” variable into a binary outcome by categorizing cars with mpg greater than the median as high fuel efficiency and the rest as low fuel efficiency:

 
mtcars$mpg_cat <- ifelse(mtcars$mpg > median(mtcars$mpg), "High", "Low")
table(mtcars$mpg_cat)

High Low
15 17

Building the Logistic Regression Model

> model <- glm(mpg_binary ~ disp + hp, data = mtcars, family = "binomial")
> summary(model)

Call:
glm(formula = mpg_binary ~ disp + hp, family = "binomial", data = mtcars)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.58261 -0.10067 -0.01016 0.29146 1.75376

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.96510 3.78501 2.369 0.0179 *
disp -0.02833 0.01449 -1.955 0.0506 .
hp -0.02685 0.02586 -1.038 0.2993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 44.236 on 31 degrees of freedom
Residual deviance: 12.327 on 29 degrees of freedom
AIC: 18.327

Number of Fisher Scoring iterations: 7

The logistic regression model provides insights into the relationship between the predictors and the outcome variable.

Now, let’s move on to evaluating the model’s performance using the ROC curve analysis. To generate the ROC curve and calculate the area under the curve (AUC), we will use the `roc` function from the `pROC` package.

predicted_probs <- predict(model, type = "response")
roc_obj <- roc(mtcars$mpg_cat, predicted_probs)

The roc function takes two arguments: the true outcome variable and the predicted probabilities from the logistic regression model. The roc function returns an object of class “roc” that represents the ROC curve.

Now, to visualize the ROC curve, we can use the `plot` function:

plot(roc_obj, main = "ROC Curve", print.auc = TRUE, auc.polygon = TRUE, grid = TRUE)

The `plot` function generates a plot of the ROC curve.

Interpreting the ROC Curve

The ROC curve provides a graphical representation of the trade-off between sensitivity and specificity. The closer the curve is to the top-left corner, the better the model’s performance. The area under the ROC curve (AUC) is a measure of the model’s discriminative power, with a value between 0.5 and 1. A higher AUC indicates better predictive performance.

In this case of an AUC of 0.980, it means that the model has a high ability to correctly discriminate between cars with high and low fuel efficiency based on the predictors (engine displacement and horsepower) used in the logistic regression.

In short, the model’s AUC of 0.980 suggests that when comparing two randomly selected cars, one with high fuel efficiency and the other with low fuel efficiency, the model will correctly rank them 98% of the time. This indicates a strong ability to distinguish between the two categories.

Happy ROCking!

--

--

infoart.ca
infoart.ca

Written by infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, BI consultant and lecturer at the University of Ottawa

No responses yet