Machine learning is a powerful tool that can be used to make predictions and decisions based on data. In this tutorial, we will cover the steps to build a machine learning project from scratch using Python and the Scikit-learn library.
Step 1: Installing Python and Scikit-learn
First, you will need to have Python installed on your computer. You can download Python from the official website and follow the installation instructions. Once you have Python installed, you can install Scikit-learn by using pip, the Python package manager. Simply run the following command in your terminal or command prompt:
pip install -U scikit-learn
Step 2: Understanding the Dataset
For this tutorial, we will be using the Iris dataset, which is a popular dataset for machine learning beginners. The dataset contains information about different species of iris flowers and their characteristics such as sepal length, sepal width, petal length, and petal width.
You can load the Iris dataset using the following code snippet:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
Step 3: Preprocessing the Data
Before building a machine learning model, it is important to preprocess the data to ensure that it is in the right format and contains only relevant information. In this step, we will normalize the data and split it into training and testing sets.
To normalize the data, you can use the StandardScaler class from Scikit-learn:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
Next, we will split the data into training and testing sets using the train_test_split function:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)
Step 4: Building and Training the Model
Now that the data is preprocessed and split into training and testing sets, we can build a machine learning model. In this tutorial, we will use a simple classification algorithm called the k-Nearest Neighbors (KNN) algorithm.
To build and train a KNN model, you can use the KNeighborsClassifier class from Scikit-learn:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
Step 5: Evaluating the Model
Once the model is trained, we can evaluate its performance on the testing set. We can use metrics such as accuracy, precision, recall, and F1 score to assess the model’s performance.
To evaluate the model, you can use the following code snippet:
from sklearn.metrics import accuracy_score, classification_report
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
classification_report = classification_report(y_test, y_pred)
print(f”Accuracy: {accuracy}”)
print(f”Classification Report: {classification_report}”)
Step 6: Making Predictions
Finally, you can use the trained model to make predictions on new data. Simply pass the new data to the predict method of the model:
new_data = [[5.1, 3.5, 1.4, 0.2]]
new_data_normalized = scaler.transform(new_data)
prediction = knn.predict(new_data_normalized)
print(f”Prediction: {iris.target_names[prediction]}”)
That’s it! You have successfully built a machine learning project from scratch using Python and Scikit-learn. Feel free to experiment with different algorithms, datasets, and preprocessing techniques to expand your machine learning skills. Happy coding!
We hope you enjoyed the workshop! We run a 6-month online data science bootcamp where participants learn practical skills, build real-world projects, get 1:1 mentorship to land their first data science job. Learn more and apply here: https://zerotodatascience.com .
you can also use this library to calculate the geo distance. it uses the same formula. example:
from geopy.distance import geodesic
# Example coordinates (latitude, longitude)
coord1 = (40.712776, -74.005974) # New York
coord2 = (34.052235, -118.243683) # Los Angeles
# Calculate the distance
distance = geodesic(coord1, coord2).kilometers
print(f"Distance: {distance} km")
dataset link pls
permission to learn sir
Hello Aakash sir, please make videos on how to deploy and monitor trained ml models (From Jupyter notebook to production environment). And which architecture should we follow in this regard. Because definitely we don't want to deploy our EDA on production. Also explain microservices and scaling. These will be so helpful🙏🙏. Thanks for your effort 💝💝.
Hi, I keep getting the error 403, cant figure out what's going wrong while downloading the datasets. I have followed all your steps. Could you help
finished practicing code
loved the popular place optimization ! data science is more art than science! what u think!
finished watching
How to split categorical and numerical data plzz help me
Show some implementation on machine translation for Indian languages starting from how to use model, train and test the model for huge dataset with its accuracy without API..
hi i m not able to run this lin of code (od.download(dataset_url)) the error is like (ApiException Traceback (most recent call last)
<ipython-input-24-4b6157cbc631> in <module>()
—-> 1 get_ipython().run_cell_magic('time', '', 'od.download(dataset_url)')
)
Can you make a vedio how to learn data analysis and data science from scratch for freeee
ValueError: Usecols do not match columns, columns expected but not found: ['dropoff_longitude', 'pickup_datetime', 'pickup_longitude', 'passenger_count', 'pickup_latitude', 'fare_amount', 'dropoff_latitude']
got an error
hey Akash, why did u use num_estimators argument while tuning hyperparameters , shouldn't be n_estimators ?
hai akash , If we want to do it with deep learning like ANN, how can we proceed in order get the accurate model like this can you explain
Hi! I'm facing the following issue while executing. Pls someone help me with this:
train_inputs = train_df[input_cols]
The timestamp for above command is 1:43:52
I'm getting this particular error—–> KeyError: "['jfk_drop_distance'] not in index"
But the "train_df" dataframe has "jfk_drop_distance" column. Proof : When I execute—> train_df.describe() I get the below info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 431098 entries, 353352 to 121958
Data columns (total 18 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 fare_amount 431098 non-null float32
1 pickup_datetime 431098 non-null datetime64[ns, UTC]
2 pickup_longitude 431098 non-null float32
3 pickup_latitude 431098 non-null float32
4 dropoff_longitude 431098 non-null float32
5 dropoff_latitude 431098 non-null float32
6 passenger_count 431098 non-null uint8
7 pickup_datetime_year 431098 non-null int64
8 pickup_datetime_month 431098 non-null int64
9 pickup_datetime_day 431098 non-null int64
10 pickup_datetime_weekday 431098 non-null int64
11 pickup_datetime_hour 431098 non-null int64
12 trip_distance 431098 non-null float64
13 jkf_drop_distance 431098 non-null float64
14 lga_drop_distance 431098 non-null float64
15 ewr_drop_distance 431098 non-null float64
16 met_drop_distance 431098 non-null float64
17 wtc_drop_distance 431098 non-null float64
dtypes: datetime64[ns, UTC](1), float32(5), float64(6), int64(5), uint8(1)
memory usage: 51.4 MB
Can you please add deployment of the the trained model in the future videos
Do a areal lidar segmentation using deep learning.
Hey akash , what's missing in the market is how to deploy and monitor trained ml models not a simple api endpoint but a microservice , especially at scale. Would appreciate it if you can cover that. As these kind of tutorials of just training and tuning models with csv as output are far too common on the internet