Understanding PCA Analysis Using Scikit-Learn in Python

Posted by


Principal Component Analysis (PCA) is a technique used for dimensionality reduction in machine learning. It is a method that transforms data from a high-dimensional space into a lower-dimensional space by finding the axes (principal components) that explain the maximum variance in the data.

In this tutorial, we will use the scikit-learn library in Python to perform PCA analysis. Scikit-learn is a powerful machine learning library that provides tools for data preprocessing, model selection, and evaluation.

Step 1: Import the necessary libraries
First, you need to import the required libraries for performing PCA analysis in Python. You will need NumPy for numerical operations, pandas for data manipulation, and scikit-learn for PCA analysis.

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

Step 2: Load the dataset
Next, you need to load the dataset that you want to perform PCA analysis on. For this tutorial, we will use the famous Iris dataset, which contains information about the sepal and petal lengths and widths of three different species of iris flowers.

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data

Step 3: Standardize the data
Before applying PCA, it is important to standardize the data so that each feature has a mean of 0 and a standard deviation of 1. This will prevent features with larger scales from dominating the principal components.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 4: Perform PCA analysis
Now we can perform PCA analysis on the standardized data. The number of principal components to retain can be specified through the n_components parameter. By default, PCA retains all principal components.

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

Step 5: Explained variance ratio
After performing PCA, you can access the explained variance ratio of each principal component. The explained variance ratio tells you how much variance in the data is explained by each principal component.

explained_variance_ratio = pca.explained_variance_ratio_
print('Explained variance ratio:', explained_variance_ratio)

Step 6: Selecting the number of components
To decide on the number of principal components to retain, you can plot the cumulative explained variance ratio as a function of the number of components.

import matplotlib.pyplot as plt
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance ratio')
plt.show()

Step 7: Visualize the principal components
Lastly, you can visualize the data in the reduced-dimensional space spanned by the principal components.

# Create a DataFrame containing the principal components
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2', 'PC3', 'PC4'])
df_pca['species'] = iris.target_names[iris.target]

# Plot the first two principal components
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='species', data=df_pca)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Analysis of Iris Dataset')
plt.legend()
plt.show()

In this tutorial, we have covered the basics of performing PCA analysis in Python using the scikit-learn library. PCA is a powerful technique for dimensionality reduction and visualization of high-dimensional data. By following these steps, you can apply PCA to your own datasets and gain insights into the underlying structure of your data.

0 0 votes
Article Rating

Leave a Reply

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x