Create a Machine Learning Project Using Python and Scikit-learn from the Ground Up

Posted by



Machine learning is a powerful tool that can be used to make predictions and decisions based on data. In this tutorial, we will cover the steps to build a machine learning project from scratch using Python and the Scikit-learn library.

Step 1: Installing Python and Scikit-learn

First, you will need to have Python installed on your computer. You can download Python from the official website and follow the installation instructions. Once you have Python installed, you can install Scikit-learn by using pip, the Python package manager. Simply run the following command in your terminal or command prompt:

pip install -U scikit-learn

Step 2: Understanding the Dataset

For this tutorial, we will be using the Iris dataset, which is a popular dataset for machine learning beginners. The dataset contains information about different species of iris flowers and their characteristics such as sepal length, sepal width, petal length, and petal width.

You can load the Iris dataset using the following code snippet:

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

Step 3: Preprocessing the Data

Before building a machine learning model, it is important to preprocess the data to ensure that it is in the right format and contains only relevant information. In this step, we will normalize the data and split it into training and testing sets.

To normalize the data, you can use the StandardScaler class from Scikit-learn:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

Next, we will split the data into training and testing sets using the train_test_split function:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)

Step 4: Building and Training the Model

Now that the data is preprocessed and split into training and testing sets, we can build a machine learning model. In this tutorial, we will use a simple classification algorithm called the k-Nearest Neighbors (KNN) algorithm.

To build and train a KNN model, you can use the KNeighborsClassifier class from Scikit-learn:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

Step 5: Evaluating the Model

Once the model is trained, we can evaluate its performance on the testing set. We can use metrics such as accuracy, precision, recall, and F1 score to assess the model’s performance.

To evaluate the model, you can use the following code snippet:

from sklearn.metrics import accuracy_score, classification_report
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
classification_report = classification_report(y_test, y_pred)

print(f”Accuracy: {accuracy}”)
print(f”Classification Report: {classification_report}”)

Step 6: Making Predictions

Finally, you can use the trained model to make predictions on new data. Simply pass the new data to the predict method of the model:

new_data = [[5.1, 3.5, 1.4, 0.2]]
new_data_normalized = scaler.transform(new_data)
prediction = knn.predict(new_data_normalized)

print(f”Prediction: {iris.target_names[prediction]}”)

That’s it! You have successfully built a machine learning project from scratch using Python and Scikit-learn. Feel free to experiment with different algorithms, datasets, and preprocessing techniques to expand your machine learning skills. Happy coding!

0 0 votes
Article Rating
20 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@jovianhq
1 month ago

We hope you enjoyed the workshop! We run a 6-month online data science bootcamp where participants learn practical skills, build real-world projects, get 1:1 mentorship to land their first data science job. Learn more and apply here: https://zerotodatascience.com .

@IsrakJahanSamir
1 month ago

you can also use this library to calculate the geo distance. it uses the same formula. example:

from geopy.distance import geodesic

# Example coordinates (latitude, longitude)

coord1 = (40.712776, -74.005974) # New York

coord2 = (34.052235, -118.243683) # Los Angeles

# Calculate the distance

distance = geodesic(coord1, coord2).kilometers

print(f"Distance: {distance} km")

@amreezkhan5530
1 month ago

dataset link pls

@kelixoderamirez
1 month ago

permission to learn sir

@mdalamgirhossain6192
1 month ago

Hello Aakash sir, please make videos on how to deploy and monitor trained ml models (From Jupyter notebook to production environment). And which architecture should we follow in this regard. Because definitely we don't want to deploy our EDA on production. Also explain microservices and scaling. These will be so helpful🙏🙏. Thanks for your effort 💝💝.

@jojojacob1521
1 month ago

Hi, I keep getting the error 403, cant figure out what's going wrong while downloading the datasets. I have followed all your steps. Could you help

@sandipansarkar9211
1 month ago

finished practicing code

@sharkk2979
1 month ago

loved the popular place optimization ! data science is more art than science! what u think!

@sandipansarkar9211
1 month ago

finished watching

@sarfarazmansoori5560
1 month ago

How to split categorical and numerical data plzz help me

@sudhansubaladas2322
1 month ago

Show some implementation on machine translation for Indian languages starting from how to use model, train and test the model for huge dataset with its accuracy without API..

@AnasKhan-vt7pc
1 month ago

hi i m not able to run this lin of code (od.download(dataset_url)) the error is like (ApiException Traceback (most recent call last)

<ipython-input-24-4b6157cbc631> in <module>()

—-> 1 get_ipython().run_cell_magic('time', '', 'od.download(dataset_url)')

)

@كنداكة-ه3د
1 month ago

Can you make a vedio how to learn data analysis and data science from scratch for freeee

@ashishramdasi336
1 month ago

ValueError: Usecols do not match columns, columns expected but not found: ['dropoff_longitude', 'pickup_datetime', 'pickup_longitude', 'passenger_count', 'pickup_latitude', 'fare_amount', 'dropoff_latitude']

got an error

@vinodd5402
1 month ago

hey Akash, why did u use num_estimators argument while tuning hyperparameters , shouldn't be n_estimators ?

@chaitanyakumarsomagani592
1 month ago

hai akash , If we want to do it with deep learning like ANN, how can we proceed in order get the accurate model like this can you explain

@abhisekrout976
1 month ago

Hi! I'm facing the following issue while executing. Pls someone help me with this:
train_inputs = train_df[input_cols]
The timestamp for above command is 1:43:52
I'm getting this particular error—–> KeyError: "['jfk_drop_distance'] not in index"

But the "train_df" dataframe has "jfk_drop_distance" column. Proof : When I execute—> train_df.describe() I get the below info:
<class 'pandas.core.frame.DataFrame'>

Int64Index: 431098 entries, 353352 to 121958

Data columns (total 18 columns):

# Column Non-Null Count Dtype

— —— ————– —–

0 fare_amount 431098 non-null float32

1 pickup_datetime 431098 non-null datetime64[ns, UTC]

2 pickup_longitude 431098 non-null float32

3 pickup_latitude 431098 non-null float32

4 dropoff_longitude 431098 non-null float32

5 dropoff_latitude 431098 non-null float32

6 passenger_count 431098 non-null uint8

7 pickup_datetime_year 431098 non-null int64

8 pickup_datetime_month 431098 non-null int64

9 pickup_datetime_day 431098 non-null int64

10 pickup_datetime_weekday 431098 non-null int64

11 pickup_datetime_hour 431098 non-null int64

12 trip_distance 431098 non-null float64

13 jkf_drop_distance 431098 non-null float64

14 lga_drop_distance 431098 non-null float64

15 ewr_drop_distance 431098 non-null float64

16 met_drop_distance 431098 non-null float64

17 wtc_drop_distance 431098 non-null float64

dtypes: datetime64[ns, UTC](1), float32(5), float64(6), int64(5), uint8(1)

memory usage: 51.4 MB

@prashammarfatia5047
1 month ago

Can you please add deployment of the the trained model in the future videos

@js913
1 month ago

Do a areal lidar segmentation using deep learning.

@ajithshenoy5566
1 month ago

Hey akash , what's missing in the market is how to deploy and monitor trained ml models not a simple api endpoint but a microservice , especially at scale. Would appreciate it if you can cover that. As these kind of tutorials of just training and tuning models with csv as output are far too common on the internet