ML10: Supervised Learning With Scikit-Learn – Logistic regression and the ROC curve
In this module, we will be discussing supervised learning using logistic regression with Scikit-Learn. Logistic regression is a popular classification algorithm that is often used to predict binary outcomes.
Logistic Regression
Logistic regression is a statistical model that is used to predict the probability of a binary outcome. It uses a logistic function to estimate the probability that a given input belongs to a particular class. In this module, we will learn how to implement logistic regression using the Scikit-Learn library in Python.
ROC Curve
The ROC curve, short for Receiver Operating Characteristic curve, is a graphical representation of the performance of a binary classification model. It plots the true positive rate (sensitivity) against the false positive rate (1 – specificity) at different threshold values. The area under the ROC curve (AUC) is a common metric used to evaluate the performance of a classification model.
Implementing Logistic Regression with Scikit-Learn
To implement logistic regression with Scikit-Learn, you will first need to import the necessary libraries:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt
Next, you can load your dataset and split it into training and testing sets using the train_test_split function:
# Load the dataset data = pd.read_csv('your_dataset.csv') # Split the data into features and target variable X = data.drop('target', axis=1) y = data['target'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Finally, you can fit a logistic regression model to your training data and generate the ROC curve:
# Fit a logistic regression model lr = LogisticRegression() lr.fit(X_train, y_train) # Generate predicted probabilities for the test data y_pred_prob = lr.predict_proba(X_test)[:,1] # Generate the ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) roc_auc = auc(fpr, tpr) # Plot the ROC curve plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic curve') plt.legend(loc="lower right") plt.show()
By following these steps, you will be able to implement logistic regression with Scikit-Learn and visualize the performance of your model using the ROC curve.