Building a Simple Neural Network with TensorFlow: A Beginner’s Guide to Handwritten Digit Classification

Shubham Gupta Leave a Comment October 9, 2024

Building a Simple Neural Network with TensorFlow: A Beginner’s Guide to Handwritten Digit Classification

Setting Up the Environment

First, you’ll need to set up your development environment with the necessary libraries.

1. Install Python

Ensure you have Python installed (preferably version 3.7 or higher). You can download it from python.org.

2. Create a Virtual Environment (Optional but Recommended)

Creating a virtual environment helps manage dependencies.

# Install virtualenv if you haven't already
pip install virtualenv

# Create a virtual environment named 'tensorflow_env'
virtualenv tensorflow_env

# Activate the virtual environment
# On Windows:
tensorflow_env\Scripts\activate
# On macOS/Linux:
source tensorflow_env/bin/activate

Keras and TensorFlow are both popular tools in the world of deep learning, but they serve different purposes. Let’s break down the difference in simple terms:

1. What is TensorFlow?

TensorFlow is a deep learning library developed by Google.
It provides the underlying tools and infrastructure for building and training machine learning models, including neural networks.
TensorFlow is powerful and flexible but can be complex because it operates at a lower level, giving you full control over every detail of the model and its training.

Think of TensorFlow as a toolbox with all sorts of different tools (like hammers, wrenches, and screwdrivers). You can build very advanced things, but you need to know how to use each tool.

2. What is Keras?

Keras is a high-level API (Application Programming Interface) that runs on top of TensorFlow (or other deep learning libraries like Theano in the past).
It’s designed to make building neural networks much easier and more intuitive. Keras hides much of the complexity of TensorFlow, allowing you to create models quickly without worrying about all the low-level details.
Keras is more user-friendly, especially for beginners.

Think of Keras as a pre-packaged kit with instructions. It uses the tools from TensorFlow but makes it easier for you to build something without having to worry about all the technical details.

Key Differences:

Aspect	TensorFlow	Keras
Purpose	Deep learning framework and library	High-level API for building deep learning models
Ease of Use	Lower-level, more complex to use	High-level, more user-friendly
Control	Offers more control and flexibility	Simplifies model-building but offers less control
Performance	More optimized for large-scale tasks	Great for fast prototyping, built on top of TensorFlow
Development	Developed by Google	Developed by François Chollet, integrated into TensorFlow
Use Case	Advanced users needing fine-grained control	Beginners and intermediate users needing simplicity

Relationship:

Keras is now included as part of TensorFlow (since TensorFlow 2.0), so you don’t need to install them separately. When you use Keras in TensorFlow, you’re essentially using a simplified interface that internally uses TensorFlow for its computations.

A neural network is a computational model inspired by the way biological neural networks in the human brain function. It consists of interconnected nodes, or neurons, organized in layers. Neural networks are used to recognize patterns, classify data, make predictions, and perform complex tasks like image recognition, natural language processing, and decision-making.

Components of a Neural Network:

Neurons (Nodes): These are the basic units of the network, similar to biological neurons. Each neuron takes input, processes it, and produces an output.
Layers:
- Input Layer: Receives the initial data. Each neuron in this layer represents a feature in the data.
- Hidden Layers: These layers are between the input and output layers and perform computations on the data. The number of hidden layers and neurons is part of the network architecture.
- Output Layer: Produces the final result or prediction.
Weights: Each connection between neurons has an associated weight that adjusts the strength of the signal passed between them. During training, these weights are updated to minimize error.
Activation Function: Determines whether a neuron should be activated or not based on the input it receives. Common activation functions include sigmoid, ReLU (Rectified Linear Unit), and tanh.

How Neural Networks Work:

Forward Propagation: Input data is passed through the network, layer by layer, until an output is produced.
Loss Function: The output is compared to the expected result, and the difference is quantified using a loss (or cost) function.
Backpropagation: This is the process where the error is propagated back through the network, and the weights are adjusted to reduce the error.

Types of Neural Networks:

Feedforward Neural Networks (FNNs): The simplest type where data moves in one direction, from input to output.
Convolutional Neural Networks (CNNs): Primarily used for image recognition tasks by detecting spatial hierarchies in images.
Recurrent Neural Networks (RNNs): These have loops that allow information to persist, making them suitable for sequential data like time series or language.
Deep Neural Networks (DNNs): These have many hidden layers and are often referred to as deep learning models. They can learn complex patterns in data.

Imaginative example to understand neural network.

Imagine a neural network as a team of friendly robots who work together to solve a puzzle, like guessing what’s in a picture. These robots pass information to each other, and each robot has its own job to help figure things out.

Layers of Robots (Neural Network Layers):

Input Layer (First Group of Robots): This is like the group of robots who look at the picture (like a drawing of a cat). They take all the tiny pieces of the picture and tell the next group what they see.

Hidden Layers (Middle Robots): These robots take the information from the first group and think about it. They break down the picture into parts, like “I see something round” (maybe the cat’s head) or “I see something long” (maybe the tail). They pass their thoughts to the next robots.

Output Layer (Final Robot): This last robot listens to all the middle robots and makes a guess. “Hmm, based on what everyone is saying, I think it’s a cat!” It could also guess other things, like a dog or a bird, but it picks what it thinks is most likely.

Each robot (or layer) talks to the next one and helps figure out the answer. Together, they solve the puzzle!

Why Layers Are Important:

The first robots see the simple things (like shapes), and the next ones understand more complicated stuff (like “Oh, this is an animal”). Finally, the last robot makes the big decision.

In simple terms: The robots (layers) work as a team, each doing a small part to guess what’s in the picture!

3. Install TensorFlow and Other Dependencies

Install TensorFlow and other necessary libraries using pip.

pip install tensorflow matplotlib

TensorFlow: The core library for building and training neural networks.Matplotlib: For visualizing data and results.

Understanding the MNIST Dataset

The MNIST dataset is a collection of 70,000 handwritten digits (0-9) split into training and testing sets. Each image is 28×28 pixels in grayscale.

Training Set: 60,000 images
Testing Set: 10,000 images

We’ll use this dataset to train our neural network to recognize handwritten digits.

Building the Neural Network

We’ll use TensorFlow’s high-level Keras API to build our model. Here’s a step-by-step guide.

1. Import Libraries

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

Load and Explore the Dataset

# Load the MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the images to [0, 1] range
train_images = train_images / 255.0
test_images = test_images / 255.0

# Explore the data
print(f"Training images shape: {train_images.shape}")
print(f"Training labels shape: {train_labels.shape}")

Explanation:

The pixel values in the MNIST dataset range from 0 to 255 because they are grayscale images. Here, 0 represents black, and 255 represents white, with values in between representing varying shades of gray.
By dividing the pixel values by 255.0, you scale them to the range [0, 1]. This process is called normalization.

Why Normalize?

Improved Model Performance: Neural networks often perform better when input data is normalized to a small range, typically between 0 and 1 or -1 and 1. This helps the model converge faster during training and can lead to better results.
Numerical Stability: Large input values can cause numerical instability during training (e.g., very large gradients). Normalizing inputs prevents this issue and ensures stable training.

Output:

Training images shape: (60000, 28, 28)
Training labels shape: (60000,)

provides information about the dimensions (or shape) of the training dataset.

Training Images Shape: (60000, 28, 28)

60000: This indicates that there are 60,000 training images in the dataset.
28, 28: Each image is a 28×28 pixel grayscale image. This means each image has a height of 28 pixels and a width of 28 pixels.

So, the shape (60000, 28, 28) means you have a dataset consisting of 60,000 images, where each image is a 2D array of 28×28 pixel values.

Training Labels Shape: (60000,)

60000: This indicates that there are 60,000 labels, one for each of the 60,000 images.
The shape (60000,) means that it’s a 1-dimensional array containing 60,000 label values, where each label is an integer (ranging from 0 to 9) representing the class of the corresponding image. Each label corresponds to the digit depicted in the respective image.

3. Visualize the Data (Optional)

Visualizing some samples can help understand the dataset.

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(train_labels[i])
plt.show()

Explanation:

plt.subplot(nrows, ncols, index): This function is used to create a grid of subplots (small individual plots) within a figure.

Parameters:

nrows=5: Specifies that the grid should have 5 rows.
ncols=5: Specifies that the grid should have 5 columns.
index=i+1: Specifies the position of the current subplot in the grid.
- i is the loop variable (0 to 24 in a loop of 25 images), so i+1 ensures that the position is 1-based (because subplot indexing starts at 1, not 0).

4. Define the Model Architecture

We’ll create a simple Sequential model with:

Flatten Layer: Converts 2D images into 1D vectors.
Dense Layers: Fully connected layers with activation functions.

model = keras.Sequential([
    layers.Flatten(input_shape=(28, 28)),      # Input layer
    layers.Dense(128, activation='relu'),      # Hidden layer
    layers.Dense(10, activation='softmax')     # Output layer
])

defines a neural network model using TensorFlow’s Keras API. This model consists of three layers: an input layer, one hidden layer, and an output layer. Let’s break down each part.

In simple words, ReLU (Rectified Linear Unit) is an activation function used in neural networks that works as follows:

If the input value is positive, ReLU keeps it the same.
If the input value is negative, ReLU changes it to zero.

Mathematically, it’s expressed as:f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x)

Example:

If the input is 5, ReLU outputs 5 (since it’s positive).
If the input is -3, ReLU outputs 0 (since it’s negative).

Why use ReLU?

Simplicity: ReLU is easy to compute.
Non-linearity: It helps the neural network learn more complex patterns by introducing non-linearity.
Efficiency: ReLU speeds up training and often leads to better performance compared to older activation functions like sigmoid or tanh.

In essence, ReLU helps the neural network focus on important positive signals and ignore negative signals, making the training process more effective.

4.1. `keras.Sequential([])`:

The Sequential model in Keras allows you to build a neural network by stacking layers in a linear order. Each layer has one input and one output, making the model simple and intuitive.

4.2. `layers.Flatten(input_shape=(28, 28))`:

Flatten Layer: This layer converts a 2D input (like a 28×28 image) into a 1D vector.
- Input shape is (28, 28) because each image is 28×28 pixels.
- After flattening, each 28×28 image (which is 2D) is transformed into a 1D vector of size 784 (28*28 = 784). This is necessary because fully connected layers (Dense layers) expect 1D vectors as input.
- Example: A single 28×28 image is transformed into a 1D array of 784 elements.

4.3. `layers.Dense(128, activation='relu')`:

Dense Layer (Fully Connected Layer): This layer is a standard fully connected neural network layer.
- 128 neurons (units): This layer contains 128 neurons. Each neuron takes the 784 input values from the previous layer (after flattening) and learns to extract features from them.
- Activation Function: relu (Rectified Linear Unit) is used as the activation function. ReLU outputs the input directly if it’s positive, otherwise, it outputs zero. It introduces non-linearity to the model and helps it learn complex patterns.

4.4 `layers.Dense(10, activation='softmax')`:

Dense Layer (Output Layer): This is the output layer of the model.
- 10 neurons (units): There are 10 neurons, corresponding to the 10 possible digit classes (0-9) in the MNIST dataset. Each neuron represents the probability of the input image belonging to one of these 10 classes.
- Activation Function: softmax is used as the activation function. Softmax converts the raw output scores of the neurons into probabilities, which sum to 1. The class with the highest probability is the predicted label.

Model Flow:

Input: The model takes an input of size (28, 28) (a single grayscale image).
Flattening: The image is flattened into a 1D array of 784 elements.
Hidden Layer: The Dense layer with 128 neurons learns to extract features from the 1D input using the ReLU activation function.
Output Layer: The final Dense layer with 10 neurons uses the softmax activation function to predict probabilities for each of the 10 possible digit classes (0 to 9).

Explanation:

Flatten Layer: Transforms each 28×28 image into a 784-element vector.
Dense Layer (128 units): Learns 128 features from the input.
Dense Layer (10 units): Outputs probabilities for each of the 10 classes (digits 0-9).

5. Compile the Model

Specify the optimizer, loss function, and metrics.

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Explanation:

In simple words, the Adam optimizer is an algorithm that helps a neural network learn by adjusting the model’s weights during training. It improves the learning process by using two key techniques:

Momentum: It remembers the direction it’s been moving in so it can continue in that direction without getting stuck or bouncing around too much.
Adaptive Learning Rate: It adjusts how fast or slow the network learns for each parameter, based on how much error is happening. This means it can make bigger adjustments where needed and smaller adjustments where things are already working well.

These features make Adam efficient, fast, and good at handling complex tasks like training deep neural networks. It’s popular because it often works well with little need for tuning.

Sparse Categorical Crossentropy is a loss function used when you have multiple classes to predict (like digits 0 to 9) and your labels are represented as numbers (like 0, 1, 2, etc.).

It helps the neural network learn by comparing the model’s predictions (probabilities for each class) with the actual class (given as a number). If the model predicts the wrong class with high confidence, it gets penalized more, so it learns to make better predictions over time.

It’s called “sparse” because you don’t need to convert the labels into a more complicated format (like one-hot encoding). You can use the simple class numbers directly.

Accuracy is a metric used to measure how well a model is performing by checking the percentage of correct predictions. It helps you monitor how well your model is learning during training and how well it is performing when evaluated on new data.

What is Accuracy?

Accuracy is the ratio of correct predictions to the total number of predictions.
If the model makes a correct prediction, it counts as a “hit”; if it makes an incorrect prediction, it counts as a “miss.”

In Simple Words:

If your model predicts 8 out of 10 images correctly, its accuracy is 80%.
Accuracy is a quick way to see how well your model is doing, both while it’s training and when you evaluate it on test data.

Training the Model

Now, let’s train our model on the training data.

# Train the model
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_split=0.1)

Parameters:

train_images & train_labels: The training data and corresponding labels.
epochs=10: Number of times the model will iterate over the entire training dataset.
validation_split=0.1: 10% of the training data will be used for validation.

Output:

Epoch 1/10
54000/54000 [==============================] - 4s 75us/sample - loss: 0.3000 - accuracy: 0.9125 - val_loss: 0.1691 - val_accuracy: 0.9470
...
Epoch 10/10
54000/54000 [==============================] - 3s 68us/sample - loss: 0.0487 - accuracy: 0.9841 - val_loss: 0.0947 - val_accuracy: 0.9728

Breakdown of the Output:

Epoch 1/10 and Epoch 10/10:

Epoch refers to one complete pass through the entire training dataset. In your case, the model is being trained for 10 epochs, meaning it will go through the training data 10 times.
The output shows details for each epoch, starting with Epoch 1/10 (the first pass through the data) and ending with Epoch 10/10 (the last pass).

54000/54000 [==============================]:

This shows that the training dataset contains 54,000 samples, and it confirms that all samples were processed during this epoch.
[==============================] is a progress bar showing how much of the epoch has been completed.

Time (e.g., 4s 75us/sample):

The time it takes to complete the epoch is displayed. For example, in Epoch 1/10, it took 4 seconds to process the entire dataset, with an average processing time of 75 microseconds per sample.

Loss and Accuracy:

Loss: This is the error the model is making during training. It’s the value the model tries to minimize as it learns. A lower loss means better performance. For example, in Epoch 1/10, the training loss is 0.3000.
Accuracy: This is how often the model is making correct predictions during training. For example, after the first epoch, the model had an accuracy of 91.25%.

val_loss and val_accuracy:

val_loss: This is the loss on the validation set (data not used during training, but used to check the model’s performance). A lower validation loss means the model generalizes better to new data. After the first epoch, the validation loss is 0.1691.
val_accuracy: This is the accuracy on the validation set. After Epoch 1, the validation accuracy is 94.70%.

Overall Explanation:

At Epoch 1/10, the model’s training accuracy is 91.25%, and validation accuracy is 94.70%, meaning the model is doing well but has room to improve.
By Epoch 10/10, the training accuracy improves to 98.41%, and the validation accuracy reaches 97.28%, showing that the model has learned well and is making better predictions on both the training and validation sets.

Each epoch helps the model improve by adjusting its internal parameters based on the data, leading to better performance over time.

Visualize Training Progress (Optional)

Plot training and validation accuracy and loss over epochs.

plt.figure(figsize=(12, 4))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.show()

This image shows two plots that track the accuracy and loss of your neural network model over time (epochs). Here’s how to interpret the graphs:

Left Plot: Accuracy Over Epochs

X-axis (Epoch): This shows the number of training cycles the model has gone through (epochs). It goes from 0 to 9, meaning your model has been trained for 10 epochs.
Y-axis (Accuracy): This shows the accuracy, or how often the model correctly predicts the right class.
- Blue Line (Training Accuracy): This line shows how well the model is performing on the training data. You can see that the accuracy increases steadily over time, reaching close to 99% by the final epoch.
- Orange Line (Validation Accuracy): This line shows how well the model is performing on unseen validation data. The accuracy increases quickly in the first few epochs and then stabilizes around 97% to 98%.

Key Insights from the Accuracy Plot:

The training accuracy continues to improve as the model trains over more epochs.
The validation accuracy plateaus after around epoch 3, suggesting that further training might not be improving the model’s performance on new data much.
Since the training accuracy keeps improving while validation accuracy plateaus, it could be a sign that the model is starting to overfit (learn too much from the training data and not generalizing well to new data).

Right Plot: Loss Over Epochs

X-axis (Epoch): Again, this shows the number of epochs (from 0 to 9).
Y-axis (Loss): This represents how far the model’s predictions are from the true labels, where a lower loss means the model is performing better.
- Blue Line (Training Loss): The training loss decreases steadily as the model learns, meaning the model is making fewer mistakes on the training data over time.
- Orange Line (Validation Loss): The validation loss decreases at first but then flattens and slightly increases after epoch 3. This could indicate that while the model is still improving on the training set, it’s starting to make slightly more errors on the validation set, again suggesting possible overfitting.

Key Insights from the Loss Plot:

The training loss is decreasing consistently, which is what you expect during training.
The validation loss decreases at first but then flattens or slightly increases, which is another indication of overfitting after a certain point.

Overall Conclusion:

Your model is learning well, as shown by the increasing training accuracy and decreasing training loss.
However, the validation metrics show signs that the model stops improving after about 3 epochs, and further training might cause overfitting. At this point, you may want to stop training early or try regularization techniques to prevent overfitting.

Evaluating the Model

After training, evaluate the model’s performance on the test dataset.

test_loss, test_accuracy = model.evaluate(test_images, test_labels, verbose=2)
print(f"\nTest accuracy: {test_accuracy}")

Breakdown of the Code:

model.evaluate(test_images, test_labels, verbose=2):
- test_images and test_labels: These are the images and labels from the test dataset, which the model has not used during training.
- evaluate(): This function computes the loss and accuracy of the model on the test data.
  - Loss: Measures how far off the model’s predictions are from the actual labels in the test set.
  - Accuracy: Tells you the percentage of correct predictions the model made on the test data.
- verbose=2: This controls the level of information displayed during the evaluation. With verbose=2, it shows just one line per epoch during evaluation.
print(f"\nTest accuracy: {test_accuracy}"):
- This prints the test accuracy to give you an idea of how well the model performed on the unseen test data.

Output:

313/313 - 0s - loss: 0.0925 - accuracy: 0.9732

Test accuracy: 0.9732

This output shows the results of evaluating your model on the test data. Here’s how to interpret it:

313/313: This indicates that the test set consists of 313 batches (the test images were divided into 313 smaller groups or batches for processing).
0s: The time it took to process the entire test set was very short, less than 1 second.
Loss: 0.0925: This is the test loss, which indicates how far off the model’s predictions were from the true labels on the test data. A loss of 0.0925 means the model performed quite well, as lower loss values are better.
Accuracy: 0.9732: This is the test accuracy, meaning the model correctly predicted the test images’ labels 97.32% of the time. This is a high accuracy, indicating that your model is performing very well on new, unseen data.

Final Conclusion:

Your model is making correct predictions on about 97.32% of the test data, which is a great performance for this task. The low test loss and high test accuracy suggest that your model has generalized well to new data and is not overfitting.

Making Predictions

Let’s use the trained model to make predictions on new data.

1. Predict on Test Images

# Make predictions
predictions = model.predict(test_images)

# predictions is a 2D array where each row corresponds to the probability of each class
print(predictions[0])  # Probabilities for the first test image

Breakdown of the Code:

predictions = model.predict(test_images):
- This line generates predictions for all the test images.
- The predict() function returns a 2D array, where:
  - Each row corresponds to a test image.
  - Each element in the row is a predicted probability for a class (in your case, digits 0 to 9).
  - For example, for the first test image, it might predict probabilities like [0.05, 0.10, 0.02, 0.80, 0.01, 0.01, 0.01, 0.00, 0.00, 0.00], meaning the model predicts with 80% confidence that the image belongs to class 3.
print(predictions[0]):
- This prints the predicted probabilities for the first test image. For example, it might output something like this:
csharp

2. Interpreting Predictions

Find the class with the highest probability.

import numpy as np

predicted_label = np.argmax(predictions[0])
print(f"Predicted label: {predicted_label}")
print(f"True label: {test_labels[0]}")

Visualizing a Prediction

plt.figure()
plt.imshow(test_images[0], cmap=plt.cm.binary)
plt.title(f"Predicted: {predicted_label}, True: {test_labels[0]}")
plt.axis('off')
plt.show()

Conclusion and Next Steps

Congratulations! You’ve successfully built, trained, and evaluated a simple neural network using TensorFlow to classify handwritten digits. Here’s a summary of what you’ve learned:

Data Handling: Loading and preprocessing data.
Model Building: Creating a neural network architecture.
Training: Fitting the model to the data.
Evaluation: Assessing model performance.
Prediction: Making predictions on new data.

Possible Extensions

To further enhance your understanding and improve the model, consider the following:

Add More Layers: Experiment with deeper architectures.
Use Convolutional Neural Networks (CNNs): They are more effective for image data.
Regularization Techniques: Apply dropout or L2 regularization to prevent overfitting.
Data Augmentation: Enhance the dataset with transformed images.
Hyperparameter Tuning: Adjust learning rates, batch sizes, etc., for better performance.

Complete Code

# Import the necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Load and preprocess the MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the images to values between 0 and 1
train_images = train_images / 255.0
test_images = test_images / 255.0

# Explore the shape of the data
print(f"Training images shape: {train_images.shape}")
print(f"Training labels shape: {train_labels.shape}")

# Visualize some training images (optional)
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(train_labels[i])
plt.show()

# Build the neural network model
model = keras.Sequential([
    layers.Flatten(input_shape=(28, 28)),      # Input layer to flatten the 2D image to 1D
    layers.Dense(128, activation='relu'),      # Hidden layer with 128 neurons and ReLU activation
    layers.Dense(10, activation='softmax')     # Output layer with 10 neurons (one for each class), using softmax
])

# Compile the model
model.compile(optimizer='adam',               # Optimizer: Adam
              loss='sparse_categorical_crossentropy',  # Loss function: sparse categorical crossentropy
              metrics=['accuracy'])            # Metric: accuracy

# Train the model
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_split=0.1)      # 10% of training data for validation

# Visualize the training progress (accuracy and loss over epochs)
plt.figure(figsize=(12, 4))

# Plot training and validation accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Plot training and validation loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.show()

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(test_images, test_labels, verbose=2)
print(f"\nTest accuracy: {test_accuracy}")

# Make predictions on the test images
predictions = model.predict(test_images)

# Display the first prediction and compare with true label
predicted_label = np.argmax(predictions[0])
print(f"Predicted label: {predicted_label}")
print(f"True label: {test_labels[0]}")

# Visualize the first test image and its predicted label
plt.figure()
plt.imshow(test_images[0], cmap=plt.cm.binary)
plt.title(f"Predicted: {predicted_label}, True: {test_labels[0]}")
plt.axis('off')
plt.show()