Data Preparation & Model Training in Machine Learning

Introduction

Ready to dive into machine learning? Let’s build a simple ML model from scratch using Python! In this step-by-step guide, we’ll train a classifier to recognize different types of iris flowers using the famous Iris dataset.

This dataset includes measurements like sepal length, sepal width, petal length, and petal width for three species: Setosa, Versicolor, and Virginica. Our goal? Train a model that can predict a flower’s species based on these measurements.

We’ll walk through the entire process in three parts:

Part 1: Loading and exploring the dataset & Preparing the data for training
Part 2: Training a classifier with scikit-learn
Part 3: Evaluating the model’s performance & Making predictions on new data

If some of these steps sound unfamiliar, don’t worry! We’ll break everything down along the way. By the end of this guide, you’ll have built and run your first machine learning project while understanding the essential ML workflow.

By the end of this part, you’ll be ready for Model Training.

Before You Start

Make sure your Python environment is set up with libraries like pandas, matplotlib, and scikit-learn (refer to Getting Started with Python for Machine Learning for setup details).

1. Step 1: Select and Understand the Dataset

For this beginner-friendly project, we’re using the Iris dataset, a small and well-known dataset often considered the “Hello World” of machine learning.

It contains measurements for three species of iris flowers:
🌸 Setosa
🌿 Versicolor
🌼 Virginica

Each flower is described by four measurements:

Sepal length
Sepal width
Petal length
Petal width

1.1 How Our Model Will Learn

The idea behind our model is simple:
Each species has a unique pattern of measurements, and our model will learn to classify a flower based on these differences.

For example:

Setosa → Short petals, small sepals
Versicolor → Medium-sized petals, moderate sepals
Virginica → Long petals, large sepals

If we look at petal length vs. petal width, we can see that each species clusters in a different area.

let’s first look at the final outcome of a scatter plot showing how the species group together based on their petal measurements:

🔴 Setosa (Red): Clearly separated with shorter and narrower petals
🟢 Versicolor (Green): Overlaps slightly but has medium-sized petals
🔵 Virginica (Blue): Has longer and wider petals

1.2 Final Outcome of This Step

✅ We understand how the model will classify flowers based on measurements.

✅ We visualized how each species forms a pattern in measurement space.

✅ We see that petal length & width are key features for classification.

2. Step 2: Loading the Dataset

Now that we understand the structure of the dataset and how our model will learn, let’s bring the data into Python so we can start working with it.

Luckily, scikit-learn provides the Iris dataset for us, so we can load it directly—no need to download anything separately.

Here’s how we load the dataset using scikit-learn:

from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Print basic details
print("Feature names:", iris.feature_names)
print("Target classes:", iris.target_names)
print("First data point:", iris.data[0], "->", iris.target_names[iris.target[0]])

2.1 Understanding the Data Output

When we run this, we get:

Feature names
- ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target classes
- ['setosa', 'versicolor', 'virginica']
First data point
- [5.1 3.5 1.4 0.2] -> setosa

This tells us that the first flower in the dataset has relatively small petals and belongs to the Setosa species.

2.2 Visualizing the Data Structure

Now, let’s break down the dataset into a table format to see how it looks:

Sepal Length (cm)	Sepal Width (cm)	Petal Length (cm)	Petal Width (cm)	Species
5.1	3.5	1.4	0.2	Setosa
4.9	3.0	1.4	0.2	Setosa
7.0	3.2	4.7	1.4	Versicolor
6.3	3.3	6.0	2.5	Virginica

Each row represents a flower, with its measurements.
The Species column tells us which type of flower it is.
Setosa has much smaller petal measurements than Virginica, confirming what we saw earlier in the scatter plot.

2.3 Final Outcome of This Step

✅ We successfully loaded the dataset into Python.
✅ We understand how the data is structured (features & labels).
✅ We see that each species has a unique range of measurements.

Now that our dataset is loaded, let’s explore it further to find patterns that can help our model learn.

3. Step 3: Exploratory Data Analysis (EDA)

Before training our model, let’s take a deeper look at the dataset. This step, called Exploratory Data Analysis (EDA), helps us understand patterns, relationships, and potential challenges in our data.

3.1 Checking the Dataset Structure

First, let’s inspect the dataset to see how many samples and features we have. We’ll also convert it into a pandas DataFrame for easier exploration.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Convert to a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target  # Add species labels

# Map numeric species labels to actual names
df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Display dataset shape and first few rows
print("Dataset shape:", df.shape)
print(df.head())

# Show summary statistics
df.describe()

Key Outputs:

Dataset shape: (150, 5) → 150 flowers, each with 4 features + species label
df.head() → Displays the first five rows
df.describe() → Shows summary statistics

3.2 Visualizing the Data: Petal Length vs. Petal Width

Since we are trying to classify species based on measurements, let’s plot a scatter plot showing the relationship between petal length and petal width.

# Scatter plot: Petal length vs. Petal width
colors = {'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}
for species, color in colors.items():
    subset = df[df['species'] == species]
    plt.scatter(subset['petal length (cm)'], subset['petal width (cm)'], 
                c=color, label=species)

plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.legend()
plt.title('Iris Species by Petal Measurements')
plt.show()

3.3 Final Outcome of This Step

✅ We explored the dataset and checked its structure.
✅ We plotted the data, showing that petal length & width are useful for classification.
✅ We examined summary statistics, confirming Setosa is easily distinguishable, while Versicolor & Virginica overlap slightly.

What’s Next?

Congratulations! You’ve successfully prepared your dataset for training your first machine learning model in Python. But how can we train the model?

In Part 2, we’ll train the model, understand steps for model trainign and use it to make predictions in later stages.

Read the next part here: Building Your First Machine Learning Model in Python: Model Training