Introduction
This is the continuation of our step-by-step guide on building your first machine learning model in Python. If you haven’t read Part 1: Data Preparation yet, I recommend checking it out first.
In this part, we’ll train our model and see how to split the data. We’ll also choose and train a model
We’ll cover:
- Why Do We Split the Data?
- Understand the Data After Splitting
- Choosing a Model: Why a Decision Tree?
- Training the Decision Tree Model
By the end of this part, you’ll know how to train a model for making predictions later.
4. Step 4: Preparing the Data (Train/Test Split)
Now that we have explored the dataset, it’s time to prepare it for training. Machine learning models need to be tested on unseen data to evaluate how well they generalize. To simulate this, we split the dataset into two parts:
- Training Set (80%): Used to train the model.
- Test Set (20%) Used to evaluate how well the model performs on new data.
4.1 Why Do We Split the Data?
🔹 Prevents overfitting – If we train on 100% of the data, the model might just memorize it instead of learning real patterns.
🔹 Ensures fair evaluation – The test set gives us a realistic measure of model accuracy on unseen data.
🔹 Maintains class balance – Using stratified sampling, we ensure that all species are evenly distributed in the train and test sets.
4.2 Splitting the Data Using train_test_split
We use scikit-learn’s train_test_split
function to randomly split our dataset. Let’s apply an 80-20 split, meaning 80% of the data is for training and 20% for testing.
from sklearn.model_selection import train_test_split # Define features (X) and labels (y) X = iris.data # Feature matrix y = iris.target # Labels (species) # Split data: 80% train, 20% test, stratify=y to maintain class proportions X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Print dataset sizes print("Training samples:", X_train.shape[0]) print("Test samples:", X_test.shape[0])
Key Outputs:
- 120 samples in the training set.
- 30 samples in the test set.
- Stratify=y ensures each species is equally represented in both sets.
4.3 Understanding the Data After Splitting
Now, let’s check how many samples of each class exist in the training and test sets.
import numpy as np # Count occurrences of each class in train and test sets unique_train, counts_train = np.unique(y_train, return_counts=True) unique_test, counts_test = np.unique(y_test, return_counts=True) # Display class distribution print("Training set class distribution:", dict(zip(unique_train, counts_train))) print("Test set class distribution:", dict(zip(unique_test, counts_test)))
Expected Output:
Training set class distribution: {0: 40, 1: 40, 2: 40} Test set class distribution: {0: 10, 1: 10, 2: 10}
Where:
- 0 → Setosa
- 1 → Versicolor
- 2 → Virginica
Since we used stratified sampling, each species is equally distributed in both training and test sets, meaning our model will learn fairly from all three species.
4.4 Final Outcome of This Step
✅ We split the data into training and test sets (80-20).
✅ We ensured class balance so that all species are fairly represented.
✅ We confirmed dataset sizes, ensuring a proper setup for training.
Now that our data is ready, let’s train our first machine learning model!
5. Step 5: Choosing and Training a Model
Now that we have prepared our dataset, it’s time to train our first machine learning model!
For this task, we need a classification algorithm that can learn from our training data and make predictions on new iris flowers.
5.1 Choosing a Model: Why a Decision Tree?
There are many machine learning models we could use, but for this example, we’ll use a Decision Tree Classifier.
- Simple to understand – It makes decisions using if-else rules, like a flowchart.
- Interpretable – We can visualize the decision-making process.
- Works well on small datasets – The Iris dataset is relatively small, so a decision tree is a great fit.
How Does a Decision Tree Work?
Imagine a series of questions like:
👉 Is petal length < 2.5 cm? → If yes, it’s probably Setosa.
👉 If no, is petal width < 1.8 cm? → If yes, it’s likely Versicolor, otherwise Virginica.
The tree keeps asking yes/no questions until it reaches a decision.
5.2 Training the Decision Tree Model
We’ll use scikit-learn’s DecisionTreeClassifier
to train our model on the training dataset (X_train, y_train).
from sklearn.tree import DecisionTreeClassifier # Initialize the model model = DecisionTreeClassifier(random_state=42) # Train (fit) the model on the training data model.fit(X_train, y_train) # Print model details print("Model trained successfully!")
What Happened Here?
- We created a DecisionTreeClassifier.
- We trained it on the training data (120 samples).
- The model has now learned patterns from the training data.
What Happened Here?
- We created a DecisionTreeClassifier.
- We trained it on the training data (120 samples).
- The model has now learned patterns from the training data.
5.3 Understanding the Decision Tree Structure
A decision tree splits data step by step. Let’s check the depth of our tree and how many final decision points (leaves) it has.
print("Tree depth:", model.get_depth()) print("Number of leaves:", model.get_n_leaves())
Expected Output Example:
Tree depth: 5 Number of leaves: 8
What Does This Mean?
- The tree depth tells us how many levels of decisions were needed.
- More depth = More complex tree (could lead to overfitting).
- The number of leaves represents the final decision points (where the model classifies the species).
Note: The values may vary depending on:
- Random factors (even with
random_state=42
, there can be slight variations). - Data splitting (how
train_test_split
assigned samples). - Hyperparameters (default settings allow unlimited depth unless restricted).
5.4 Final Outcome of This Step
✅ We chose a Decision Tree Classifier for our model.
✅ We trained the model on the Iris dataset.
✅ We checked the complexity of our trained model.
Our model is now trained and ready to make predictions!
What’s Next?
Congratulations! You’ve successfully trained your first machine learning model in Python. But how well does it perform?
In Part 3, we’ll evaluate the model’s accuracy, understand key performance metrics, and use it to make predictions.
Read the next part here: Building Your First Machine Learning Model in Python: Model Evaluation & Prediction