3.7- Data Partitioning in Scikit Learn – تقسيم البيانات في سايكت ليرن

Posted by


In machine learning, data splitting is a crucial step in the model building process. It involves dividing a dataset into separate training and testing sets to evaluate the performance of the model on unseen data. This helps in assessing the generalization ability of the model and ensures that it does not overfit the training data.

Scikit-learn, a popular machine learning library in Python, provides a simple and efficient way to split data using its train_test_split function. In this tutorial, we will walk through the process of splitting data using scikit-learn and discuss some best practices.

  1. Importing the necessary libraries:

    import numpy as np
    from sklearn.model_selection import train_test_split
  2. Loading the dataset:
    Before splitting the data, you need to load your dataset. For demonstration purposes, let’s use the built-in Iris dataset in scikit-learn.

    from sklearn.datasets import load_iris
    iris = load_iris()
    X = iris.data
    y = iris.target
  3. Splitting the data:
    Now, we can split the data into training and testing sets using the train_test_split function. The function takes the feature matrix (X) and target vector (y) as inputs, along with the test size and random state.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    In this example, we are splitting the data into 80% training and 20% testing sets. The random_state parameter ensures reproducibility by fixing the randomness during the split.

  4. Checking the shapes of the datasets:
    It’s always a good practice to verify the shapes of the training and testing sets to ensure that the split was successful.

    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
  5. Training a model:
    Once the data is split, you can proceed with model training using the training set (X_train and y_train). For example, let’s train a simple logistic regression model.

    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    model.fit(X_train, y_train)
  6. Evaluating the model:
    After training the model, you can evaluate its performance on the testing set (X_test and y_test) to assess its generalization ability.

    accuracy = model.score(X_test, y_test)
    print("Model accuracy:", accuracy)
  7. Conclusion:
    Data splitting is an essential step in machine learning to ensure the model’s performance on unseen data. Scikit-learn provides a convenient way to split data using the train_test_split function. By following this tutorial, you should now have a solid understanding of how to split data in scikit-learn and build models effectively.
0 0 votes
Article Rating
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@ZHA2065
1 month ago

شكرا على كرمك وشرحك ❤

@user-vf7ud4tb2h
1 month ago

جزاك الله خيرا

@AbdullahNajeh-c7f
1 month ago

شرحك اسطوري الله يبارك فيك

@salehabbas5072
1 month ago

الى العالمية مع بش مهندس احمد..شرح عالمي والله🥰