Introduction to Classification using Scikit-Learn: ML02 Supervised Learning

Posted by

In this tutorial, we will be discussing the concepts of supervised learning with Scikit-Learn specifically focusing on classification. Supervised learning is a type of machine learning where the algorithm learns from labeled training data in order to make predictions on unseen data. Classification is a type of supervised learning where the goal is to predict the category that a new data point belongs to based on features present in the data.

Scikit-Learn is a powerful Python library that provides a wide range of machine learning algorithms and tools for data processing and model evaluation. In this tutorial, we will be using Scikit-Learn to build and evaluate classification models on a sample dataset.

To get started, make sure you have Scikit-Learn installed. You can install it using the following command:

pip install scikit-learn

Now we can start by importing the necessary libraries and loading the dataset. For this tutorial, we will be using the famous Iris dataset which contains information about different species of iris flowers.

<!DOCTYPE html>
<html>
<head>
   <title>ML02: Supervised Learning With Scikit-Learn-Classification</title>
</head>
<body>
   <h1>ML02: Supervised Learning With Scikit-Learn-Classification</h1>
   <p>Import the necessary libraries</p>
   <code>
       import numpy as np
       import pandas as pd
       from sklearn.datasets import load_iris
       from sklearn.model_selection import train_test_split
       from sklearn.ensemble import RandomForestClassifier
       from sklearn.metrics import accuracy_score
   </code>
   <p>Load the dataset</p>
   <code>
       iris = load_iris()
       X = iris.data
       y = iris.target
   </code>
</body>
</html>

As shown in the code snippet above, we import the necessary libraries including NumPy for numerical operations, Pandas for data manipulation, and the required modules from Scikit-Learn. We also load the Iris dataset using the load_iris function and assign the features to X and the target variable to y.

Next, we will split the data into training and testing sets using the train_test_split function. This will allow us to train our model on a portion of the data and evaluate its performance on unseen data.

<p>Split the data into training and testing sets</p>
<code>
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
</code>

In this snippet, we split the data into training and testing sets with a test size of 20% using a random seed of 42 for reproducibility.

Now, we can proceed to train a classification model on the training data. In this tutorial, we will be using a Random Forest Classifier, a popular ensemble learning algorithm.

<p>Train a Random Forest Classifier</p>
<code>
   clf = RandomForestClassifier()
   clf.fit(X_train, y_train)
</code>

After training the model, we can make predictions on the testing data and evaluate its performance using accuracy as a metric.

<p>Make predictions and evaluate the model</p>
<code>
   y_pred = clf.predict(X_test)
   accuracy = accuracy_score(y_test, y_pred)
   print(f'Accuracy: {accuracy}')
</code>

Finally, we can use the model to make predictions on new, unseen data.

<p>Make predictions on new data</p>
<code>
   new_data = np.array([[5.1, 3.5, 1.4, 0.2]])
   prediction = clf.predict(new_data)
   print(f'Predicted class: {iris.target_names[prediction[0]]}')
</code>

In this tutorial, we covered the basics of supervised learning with Scikit-Learn focusing on classification. We loaded the Iris dataset, split it into training and testing sets, trained a Random Forest Classifier, evaluated its performance, and made predictions on new data.

This is just a starting point, and there’s much more to learn about machine learning and Scikit-Learn. I encourage you to explore different algorithms, metrics, and datasets to gain a deeper understanding of the field. Happy learning!