Marco Azimonti Personal Homepage

Feedforward Neural Networks

Feedforward neural networks (FNNs) are a class of artificial neural networks where the flow of information moves strictly in one direction—from the input layer, through any hidden layers, to the output layer. There are no feedback loops or cycles, meaning that once the data passes through a layer, it does not return to any of the previous layers. This architecture makes FNNs simple yet powerful tools for a wide range of computational tasks.

Unlike recurrent neural networks (RNNs), which allow for feedback connections and can maintain internal states, FNNs have a straightforward structure that is designed to process input data in a sequential manner. The key characteristic of FNNs is that they are memoryless—each prediction is based solely on the current input, without considering any past inputs or outputs.

FNNs can handle both classification and regression tasks, depending on the nature of the output. In classification tasks, they produce categorical outputs, while in regression tasks, they predict continuous values. These networks are particularly useful when there is a clear input-to-output mapping without the need to account for sequential dependencies, making them ideal for static data analysis.

Network structure

Activation functions

Sigmoid

Tanh (Hyperbolic Tangent)

ReLU (Rectified Linear Unit)

Leaky ReLU

Softmax

Role of Activation Functions

Training

Backpropagation and Gradient Descent

Complete Code (GitHub)

Network structure

The structure is organized into layers, and information flows sequentially through these layers:

Input Layer: The network begins with an input layer, which receives the raw data. Each neuron in the input layer represents a feature from the input dataset.
Hidden Layers: After the input layer, the data is passed through one or more hidden layers. These layers transform the data through a series of weighted connections. Each neuron in a hidden layer receives input from all neurons in the previous layer, processes this information, and passes it to the next layer.
Output Layer: The final layer is the output layer, where the network produces the result. This could be a prediction, classification, or regression value depending on the task at hand.

The flow of information in a feedforward neural network is strictly unidirectional, moving from the input layer to the output layer without any backward connections. First, the input data is passed into the network through the input layer, where it is received by the first hidden layer. Each hidden layer then processes the data by applying a set of transformations, including the application of weights and biases, and sends the transformed data to the next layer. This process continues layer by layer until the output layer is reached, where the final result of the network is produced.

At each stage, the input is progressively transformed, but once the data has been processed by a layer, it does not loop back or revisit any previous layers. This lack of feedback is what fundamentally distinguishes feedforward neural networks from other types of neural architectures, such as recurrent neural networks.

Activation functions

Activation functions play a crucial role in introducing non-linearity into the network. Without activation functions, the network would only be capable of learning and representing linear relationships, which would significantly limit its ability to model complex, real-world data.

Each neuron in the hidden and output layers of an FNN applies an activation function to its input, which allows the network to approximate non-linear mappings between inputs and outputs.

Some of the common activation functions are described below.

Sigmoid

The sigmoid function is defined as:

f(x) = \frac{1}{1 + e^{-x}}

This function maps any input to a value between 0 and 1, making it particularly useful for binary classification tasks where outputs need to be constrained within this range. However, despite its utility in simple networks, the sigmoid function is prone to issues like vanishing gradients, which can slow down or even halt the training process in deep networks, rendering it less ideal for complex architectures.

Tanh (Hyperbolic Tangent)

The hyperbolic tangent, or tanh, function is given by:

f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Unlike the sigmoid function, the tanh function maps input values to a range between -1 and 1. This wider output range makes it useful in cases where negative outputs are meaningful. It shares some properties with the sigmoid function but provides stronger gradients for inputs far from zero, which can improve training in certain situations.

ReLU (Rectified Linear Unit)

The ReLU function, defined as:

f(x) = \max(0, x)

is both simple and computationally efficient. It sets all negative inputs to zero while leaving positive values unchanged. ReLU’s efficiency and its ability to prevent the vanishing gradient problem, where gradients become too small to effectively update weights, make it one of the most popular activation functions in deep neural networks.

Leaky ReLU

Leaky ReLU is a variant of ReLU, described by the equation:

f(x) = \max(0.01x, x)

It addresses one of the limitations of standard ReLU, known as the “dying ReLU” problem, where neurons can sometimes become inactive and stop learning. By allowing a small, non-zero gradient for negative inputs, Leaky ReLU ensures that neurons can still update even when the input is negative, improving network performance in some cases.

Softmax

The softmax function is typically used in the output layer of classification networks and is expressed as:

f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

It transforms the network’s output into a probability distribution over multiple classes, with the sum of the outputs equaling 1. This makes softmax particularly suited for multi-class classification problems, where the network needs to predict probabilities for each class.

Role of Activation Functions

Activation functions are crucial for the success of FNNs because they introduce non-linearity into the network. This non-linearity allows the network to capture complex relationships within the data, enabling it to approximate highly intricate mappings between inputs and outputs. Furthermore, activation functions facilitate backpropagation by maintaining gradients that are necessary for efficient weight updates during training. Without them, FNNs would simply behave like linear models, drastically reducing their capacity to solve real-world problems.

Training

Training the involves adjusting its weights and biases to minimize the error between its predicted output and the actual target values. This is typically done through supervised learning, where the network is provided with labeled input-output pairs.

During forward propagation, input data passes through the network, layer by layer, to produce an output. Each neuron processes the input by applying weights, biases, and an activation function. The network’s final output is then compared to the true target value to calculate the error or loss.

The loss function quantifies the difference between the predicted output and the actual target. Commonly used loss functions include Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks. The objective during training is to minimize this loss by updating the network’s parameters. For example, the MSE loss is calculated as:

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2

where y_i is the true label, and \hat{y_i} is the predicted output.

The process of backpropagation calculates the gradient of the loss function with respect to each weight in the network by applying the chain rule of calculus. This allows the error to propagate backward from the output layer through the hidden layers. The computed gradient shows the direction in which the weights should be adjusted to reduce the loss.

The gradients obtained from backpropagation are used by an optimization algorithm to update the weights and biases. Stochastic Gradient Descent (SGD) is a commonly used optimization method, though other variants like Adam and RMSprop are often employed for faster and more stable convergence. The general weight update rule in gradient descent is given by:

w = w - \eta \cdot \nabla L

where: - w is the weight to be updated, - \eta is the learning rate, controlling the size of the weight adjustment, - \nabla L is the gradient of the loss function with respect to the weight.

Training happens over several epochs, with each epoch representing a full pass through the training data. The data is usually divided into smaller batches, and weight updates are made after processing each batch, allowing the model to improve incrementally. After each epoch, the network’s performance is evaluated, and training continues until the loss converges or the maximum number of epochs is reached.

The learning rate controls the size of the updates to the weights during training. If it is set too high, the model may not converge; if too low, training may be slow. Overfitting can occur if the model fits the training data too closely but fails to generalize to new data. Regularization techniques such as dropout and L2 regularization can mitigate overfitting. Another strategy, early stopping, halts training when performance on a validation set stops improving, helping to prevent overfitting.

Backpropagation and Gradient Descent

Backpropagation and gradient descent are the core mechanisms by which a feedforward neural network (FNN) learns during training. These methods work together to iteratively adjust the weights and biases of the network in order to minimize the error between the predicted outputs and the actual target values.

Backpropagation, short for “backward propagation of errors,” is the algorithm responsible for calculating how much each weight in the network contributes to the overall error. The process begins by performing forward propagation, where the input data is passed through the network layer by layer to generate a predicted output. Once the output is produced, the loss function compares this output to the actual target value and computes the error. Backpropagation then propagates this error backward through the network, starting from the output layer and moving toward the input layer.

The key principle behind backpropagation is the chain rule from calculus. This rule allows us to calculate the gradient of the loss function with respect to each weight by decomposing the error into components that are specific to each layer. The error signal at each neuron is used to calculate the gradient of the loss with respect to the weights and biases in that layer. These gradients are essential for guiding the adjustment of the network’s parameters during the learning process.

Once the gradients are computed through backpropagation, gradient descent is applied to update the weights and biases in the direction that reduces the loss. Gradient descent is an optimization algorithm that adjusts the weights by taking small steps in the direction of the negative gradient of the loss function. The size of these steps is determined by a parameter called the learning rate. If the learning rate is too large, the steps might be too aggressive, causing the network to overshoot the optimal solution. If the learning rate is too small, the updates will be too slow, and training will take much longer to converge.

The general rule for updating a weight in gradient descent can be expressed as:

w = w - \eta \cdot \nabla L

where w represents the weight being updated, \eta is the learning rate, and \nabla L is the gradient of the loss function with respect to the weight. This update rule ensures that the weights are adjusted in a way that moves the network closer to minimizing the error.

The backpropagation and gradient descent process is repeated over many iterations, often across multiple epochs. During each epoch, the network processes the entire training dataset, computes the loss, and updates the weights based on the gradients. As training progresses, the weights and biases are fine-tuned to minimize the loss function, leading the network to produce more accurate predictions. The goal of this iterative process is to find the set of parameters that best maps the input data to the desired output.

One important consideration in backpropagation is the choice of the loss function, as different tasks require different loss functions to measure error effectively. For regression tasks, mean squared error is commonly used, while for classification tasks, cross-entropy loss is more appropriate. Additionally, in deep networks, vanishing or exploding gradients can hinder the backpropagation process, making it difficult to train the network effectively. To mitigate these issues, alternative activation functions like ReLU are often used, and advanced optimization techniques such as Adam or RMSprop can be applied to improve convergence.

Overall, backpropagation and gradient descent form the backbone of the learning process in FNNs, enabling the network to iteratively learn from data, reduce errors, and improve its predictions over time.

Genetic Algorithm

An alternative method for training feedforward neural networks (FNNs) is through the use of genetic algorithms (GAs), which are inspired by the principles of natural selection and evolution. Instead of using gradient-based optimization like backpropagation, genetic algorithms approach training by treating the network’s weights and biases as individuals in a population.

The process begins by initializing a population of networks with random parameters. Each network is evaluated based on a fitness function, which is typically related to the performance of the network on a given task (such as accuracy or minimizing loss). The best-performing networks are then selected to “reproduce” by combining their weights through crossover operations, and mutations are applied to introduce small random changes in their parameters.

This process of selection, crossover, and mutation continues for several generations, with each generation producing networks that, on average, perform better than the previous one. Genetic algorithms are particularly useful when the error surface is highly non-linear or contains multiple local minima, as they do not rely on gradients and can explore a broader search space to find optimal solutions.

In general, GAs tend to be more computationally expensive and slower to converge compared to traditional gradient-based methods.

Go to the top of the page

C++ implementation

Base class

In my C++ implementation of feedforward neural networks (FNNs), I have created a base class ANN_MLP within the nn namespace. This class provides the fundamental structure needed to manage the essential aspects of the network, such as storing network parameters, handling serialization and deserialization, and keeping track of the network’s state during training.

The ANN_MLP class supports a variety of functionalities. It allows the user to print the network’s configuration, including its weights and biases, through dedicated methods like PrintNetworkInfo, PrintBiases, and PrintWeights. These functions help in debugging and analyzing the structure of the network at any given point in the training process.

To manage the training process, the class includes methods for setting and updating the number of training epochs. With SetEpochs, the user can define the total number of epochs for training, while the UpdateEpochs method can increment the epoch count as the training progresses. This keeps the training process flexible and allows for dynamic adjustments as needed.

Another key feature of the class is its ability to handle data persistence. The Serialize and Deserialize methods allow the network to be saved to and loaded from a file. This is crucial for storing the state of a trained model or resuming training from a previously saved state.

Additionally, ANN_MLP provides inline methods for querying various properties of the network, such as the population size (GetPopSize), the number of top performers (GetTopPerformersSize), and the current epoch count (GetEpochs). These methods are designed for quick and efficient access to the network’s attributes, enabling the user to monitor and manage the network’s state during training and testing phases.


namespace nn
{
    template <typename T> class ANN_MLP
    {
      public:
        void PrintNetworkInfo();
        void PrintBiases();
        void PrintWeights();

        void SetName(const std::string& s) { sName = s; };

        void SetEpochs(size_t n) { nEpochs = n; };

        void UpdateEpochs(size_t n = 1) { nEpochs += n; };

        void Serialize(const std::string& fname);
        void Deserialize(const std::string& fname);

        inline size_t GetPopSize() const { return nPopSize; }

        inline size_t GetTopPerformersSize() const { return nTop; }

        inline size_t GetEpochs() const { return nEpochs; }

      protected:
        // ...

        inline T GetRandomNormal() { return normal_distribution(generator); }

        inline T GetRandomUniformReal() { return static_cast<T>(uniform_real_distribution(generator)); }

        inline int GetRandomUniformInt() { return uniform_int_distribution(generator); }
        // ...
    };

} // namespace nn

Overall, this class serves as a foundation for more complex neural network architectures by encapsulating the essential features required to manage and train a feedforward neural network in C++.

Activation functions

I have included both the sigmoid and tanh activation functions, along with their derivatives, to handle the non-linear transformations necessary during forward propagation and backpropagation. These activation functions are defined in both scalar and vectorized forms to ensure computational efficiency during matrix operations.

The ActFunc template functions serve as flexible wrappers that apply the chosen activation function to a matrix. This design supports both element-wise operations and vectorized transformations, depending on the function signature passed. By utilizing function pointers, the implementation can efficiently apply either the sigmoid or tanh function, making it adaptable for different activation needs.

For the sigmoid function, the implementation provides both forward and derivative versions, ensuring that the derivative can be used during backpropagation for updating the weights in stochastic gradient descent (SGD). The same structure is followed for the tanh function, with the derivative playing a key role in calculating gradients.


#define T_C(x) static_cast<T>(x)

namespace nn
{
    enum ACT : size_t { SIGMOID = 0, TANH };

    template <typename S> inline la::Matrix<S>& ActFunc(la::Matrix<S>& res, void (*funcptr)(S*, size_t))
    {
        funcptr(res.data().data(), res.size());
        return res;
    }

    template <typename S> inline la::Matrix<S>& ActFunc(la::Matrix<S>& res, S (*funcptr)(S))
    {
        std::vector<S>& v_ = res.data();
        for (size_t i = 0; i < res.size(); ++i) v_[i] = funcptr(v_[i]);
        return res;
    }

    template <typename S>
    inline la::Matrix<S>& ActFunc(la::Matrix<S>& res, const la::Matrix<S>& a, void (*funcptr)(S*, const S*, size_t))
    {
        assert(res.size() == a.size());
        funcptr(res.data().data(), a.data().data(), a.size());
        return res;
    }

    template <typename S> inline la::Matrix<S>& ActFunc(la::Matrix<S>& res, const la::Matrix<S>& a, S (*funcptr)(S))
    {
        assert(res.size() == a.size());
        for (size_t i = 0; i < res.size(); ++i) res.data()[i] = funcptr(a.data()[i]);
        return res;
    }

    //**********************
    //      sigmoid
    //**********************
    template <typename T> inline T sigmoid(T x)
    {
        return T_C(1) / (T_C(1) + exp(-x));
    }

    template <typename T> inline void sigmoid(T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) arr[i] = T_C(1) / (T_C(1) + exp(-arr[i]));
    }

    template <typename T> inline void sigmoid(T* res, const T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) res[i] = T_C(1) / (T_C(1) + exp(-arr[i]));
    }

    template <typename T> inline T dsigmoid(T x)
    {
        return sigmoid(x) * (T_C(1) - sigmoid(x));
    }

    template <typename T> inline void dsigmoid(T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) arr[i] = dsigmoid(arr[i]);
    }

    template <typename T> inline void dsigmoid(T* res, const T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) res[i] = dsigmoid(arr[i]);
    }

    //**********************
    //     tanh
    //**********************
    template <typename T> inline T tanh(T x)
    {
        return std::tanh(x);
    }

    template <typename T> inline void tanh(T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) arr[i] = tanh(arr[i]);
    }

    template <typename T> inline void tanh(T* res, const T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) res[i] = tanh(arr[i]);
    }

    template <typename T> inline T dtanh(T x)
    {
        T th = tanh(x);
        return T_C(1) - th * th;
    }

    template <typename T> inline void dtanh(T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) arr[i] = dtanh(arr[i]);
    }

    template <typename T> inline void dtanh(T* res, const T* arr, const size_t size)
    {
        for (size_t i = 0; i < size; ++i) res[i] = dtanh(arr[i]);
    }
}

This approach allows seamless integration of these activation functions into the network, facilitating forward propagation with non-linear transformations and ensuring accurate gradient computation during backpropagation, which is essential for effective training through SGD. The ability to handle these operations at both the individual element and vector levels enhances the overall performance of the neural network, especially when dealing with large datasets.

SGD derived class

In my implementation, the ANN_MLP_SGD class extends the base ANN_MLP class to specifically handle training and testing using the stochastic gradient descent (SGD) method. While the base class provides the core functionality required for managing the structure, weights, biases, and serialization of the network, the ANN_MLP_SGD class focuses on implementing the logic for training the network using mini-batches of data and adjusting the network parameters based on the computed gradients.

The ANN_MLP_SGD class inherits all the essential properties from ANN_MLP, including the network’s size, layers, activation function, and population-specific parameters. It utilizes this setup to execute the iterative training process through SGD. The core methods in this derived class, TrainSGD and TestSGD, are responsible for the actual training and evaluation of the network, taking input data and corresponding reference outputs for supervision.


namespace nn
{
    template <typename T> class ANN_MLP_SGD : public ANN_MLP<T>
    {
        using ANN_MLP<T>::vBiases;
        using ANN_MLP<T>::vWeights;
        // using most of the base class

      public:
        ANN_MLP_SGD();
        ANN_MLP_SGD(std::vector<size_t> size, int seed = 7419, size_t activationFunction = SIGMOID);
        void TrainSGD(const std::vector<std::vector<T>>& data, const std::vector<std::vector<T>>& reference,
                      size_t epochs, size_t miniBatchSize, double eta);

        int TestSGD(const std::vector<std::vector<T>>& data, const std::vector<std::vector<T>>& reference);

      private:
    };
}

By utilizing the existing architecture of the base class, ANN_MLP_SGD implements the training loop and the logic required for updating the network’s weights and biases based on the mini-batch gradients. This derived class ensures that the neural network can efficiently learn from data by optimizing its parameters through repeated epochs of training, using the SGD approach to incrementally reduce the error.

Training

The TrainSGD method is the core function that implements the stochastic gradient descent (SGD) algorithm in the ANN_MLP_SGD class. This method takes as input the training data, corresponding reference outputs, the number of epochs, the mini-batch size, and the learning rate (eta). The SGD algorithm works by dividing the training data into mini-batches, which are then used to compute gradients and update the network’s weights and biases incrementally.

The training process begins by initializing several necessary components. A random device and a random number generator are used to shuffle the training data at the beginning of each epoch. Activation function pointers (pAct) and derivative pointers (pDAct) are set based on the activation function being used, which could be either sigmoid or tanh. This allows the network to apply the correct activation function and its derivative during both the forward and backward passes of the training process.

Memory is then allocated for storing various matrices that will be used in the gradient descent process, such as activations (na_), weighted sums before activation (nzv_), gradients of activations (dno_), and the gradient updates for weights and biases (nw_, nb_).

For each epoch, the data is shuffled to ensure that mini-batches are sampled in a randomized order. For each mini-batch, the gradients for weights and biases are zeroed before processing the batch. Each data sample in the mini-batch is fed through the network, performing forward propagation to compute the activations for each layer.


for (size_t j = 0; j < s_.size() - miniBatchSize; j += miniBatchSize)
{
    // process a mini-batch of data
    it_ = s_.begin() + (long)j;
    // Zeros the Gradient
    for (size_t l = 0; l < nLayers - 1; ++l)
    {
        nb_[l].Zeros();
        nw_[l].Zeros();
    }

    for (size_t k = 0; k < miniBatchSize; ++k)
    {
        for (size_t l = 0; l < nLayers - 1; ++l)
        {
            nzv_[l].Zeros();
            dnb_[l].Zeros();
            dnw_[l].Zeros();
            // as the weights are not updated, the transpose can be computed here
            la::MatTranspose(wt_[l], vWeights[0][l]);
        }

        // first activation layer - input data
        na_[0].assign(data[*it_]);
        for (size_t j = 1; j < nLayers; ++j)
        {
            la::MatMultVec(nzv_[j - 1], vWeights[0][j - 1], na_[j - 1]);
            nzv_[j - 1] += vBiases[0][j - 1];
            nn::ActFunc(na_[j], nzv_[j - 1], pAct);
        }
    }
}

The backpropagation process in the TrainSGD method plays a crucial role in calculating the gradients for weight and bias updates. Once forward propagation has been completed for a mini-batch, backpropagation begins by calculating the error at the output layer and then propagating this error backward through the hidden layers.

The first step in backpropagation is to compute the error at the output layer. In your implementation, this is done by subtracting the reference (target) values from the activations of the output layer (na_). This error is stored in the dno2_ matrix. The next step is to apply the derivative of the activation function to the pre-activation values (nzv_) at the output layer using the pDAct function pointer. This derivative is stored in nzv_ and is then used to compute the element-wise product (Hadamard product) of dno_ and dno2_, resulting in the delta for the output layer. This delta is stored in dnb_, which represents the gradient of the bias for the output layer, and dnw_, which represents the gradient of the weights. The outer product of the delta and the activations from the previous layer is computed using MatOuter to form the gradient for the weights.

After calculating the gradients for the output layer, the algorithm moves backward through the hidden layers. For each hidden layer, the derivative of the activation function is applied to the pre-activation values (nzv_), and the delta is computed using the weights from the next layer. Specifically, the transpose of the weights is multiplied by the delta from the subsequent layer using MatMultVec, and the Hadamard product is applied to compute the new delta for the current layer. This delta is then used to compute the gradients for both the biases (dnb_) and the weights (dnw_) for the current layer, following the same process as in the output layer.

Once the deltas and gradients are computed for all layers, the accumulated gradients for the mini-batch are summed up. The final step in backpropagation is the update step, where the computed gradients are applied to the weights and biases. This is done by scaling the gradients by the learning rate (eta) and the mini-batch size to ensure proper adjustments. The scaled gradients are subtracted from the current weights and biases, effectively updating the network’s parameters to minimize the error.


{
    // compute delta from the output layer
    const size_t osize = dno_.size() - 1;
    dno2_[osize] = na_[osize + 1];
    dno2_[osize] -= reference[*it_];
    nn::ActFunc(nzv_[osize], pDAct);
    MatHadamard(dno_[osize], dno2_[osize], nzv_[osize]);
    dnb_[osize] = dno_[osize];
    MatOuter(dnw_[osize], dno_[osize], na_[osize]);
}

for (size_t l = nLayers - 2; l > 0; --l)
{
    nn::ActFunc(nzv_[l - 1], pDAct);
    // the transpose is stored at the beginning of the minibatch
    la::MatMultVec(dno2_[l - 1], wt_[l], dno_[l]);
    MatHadamard(dno_[l - 1], dno2_[l - 1], nzv_[l - 1]);
    dnb_[l - 1] = dno_[l - 1];
    MatOuter(dnw_[l - 1], dno_[l - 1], na_[l - 1]);
}

// update nabla
for (size_t l = 0; l < (nLayers - 1); ++l)
{
    nw_[l] += dnw_[l];
    nb_[l] += dnb_[l];
}

// update to the next sample
it_++;

// update bias and weights
for (size_t l = 0; l < (nLayers - 1); ++l)
{
    nw_[l] *= (eta / static_cast<double>(miniBatchSize));
    vWeights[0][l] -= nw_[l];
    nb_[l] *= (eta / static_cast<double>(miniBatchSize));
    vBiases[0][l] -= nb_[l];
}

This process of backpropagation ensures that the gradients are propagated backward through the network, allowing each layer to contribute to the overall optimization of the network’s weights and biases during the training process.

Testing

The TestSGD function of the ANN_MLP_SGD class provides a straightforward implementation for evaluating the performance of the trained neural network. This method takes two parameters: the test data and the corresponding reference outputs. It computes how well the network performs on the test data by running it through the network and comparing the predicted output with the reference output to count the number of correct predictions.

The function begins by selecting the appropriate activation function based on the network’s configuration (either sigmoid or tanh). Next, a counter (iCorrect) is initialized to keep track of how many predictions are correct. A vector of matrices (na_) is created to store the activations at each layer as the data passes through the network.

For each input sample in the test dataset, the activations for the input layer (na_[0]) are set to the corresponding input data. The function then performs forward propagation by multiplying the activations from the previous layer by the weights of the current layer and adding the biases. After each multiplication and addition, the activation function is applied to the resulting values. This process continues layer by layer until the network produces an output.

Once the network produces its output, the function identifies the position of the maximum element in the output layer, which corresponds to the predicted class. This is done using std::max_element. Similarly, the position of the maximum element in the reference output is determined. If the predicted position matches the reference position, the iCorrect counter is incremented.


template <typename T>
int nn::ANN_MLP_SGD<T>::TestSGD(const std::vector<std::vector<T>>& data, const std::vector<std::vector<T>>& reference)
{
    void (*pAct)(T*, size_t) = nullptr;

    switch (act)
    {
    case SIGMOID: pAct = &sigmoid<T>; break;
    case TANH: pAct = &tanh<T>; break;
    default: throw std::invalid_argument("Unknown activation function");
    }

    (void)pAct;
    int iCorrect = 0;
    std::vector<la::Matrix<T>> na_;
    for (size_t i = 0; i < nLayers; ++i) { na_.emplace_back(la::Matrix<T>(vSize[i], 1)); }

    for (size_t i = 1; i < data.size(); ++i)
    {
        for (size_t l = 0; l < na_[0].GetRowsNb(); ++l) na_[0][l][0] = data[i][l];

        for (size_t j = 1; j < nLayers; ++j)
        {
            la::MatMultVec(na_[j], vWeights[0][j - 1], na_[j - 1]);
            na_[j] += vBiases[0][j - 1];
            nn::ActFunc(na_[j], pAct);
        }
        // find the max element
        typename std::vector<T>& res = na_[nLayers - 1].data();
        const typename std::vector<T>::iterator it_ = std::max_element(res.begin(), res.end());
        const ptrdiff_t maxPos = std::distance(res.begin(), it_);
        const ptrdiff_t refPos = std::distance(reference[i].begin(), std::max_element(reference[i].begin(), reference[i].end()));
        if (maxPos == refPos) iCorrect++;
    }

    return iCorrect;
}

After processing all the test samples, the function returns the total number of correct predictions. This count provides a measure of the network’s accuracy on the test data.

This testing method allows for a simple evaluation of the network’s performance by comparing its predictions against known reference values and counting how many predictions are accurate.

GA derived class

In my C++ implementation of feedforward neural networks (FNNs) using genetic algorithms (GA), the ANN_MLP_GA class is derived from the base ANN_MLP class. While the base class manages the core network structure, weights, biases, and activation functions, the ANN_MLP_GA class adds the necessary components for handling the population of networks, which is essential for GA-based optimization.

Unlike the traditional stochastic gradient descent (SGD) method, where a single network is trained by iteratively updating weights and biases, genetic algorithms work by evolving a population of networks. Each member of the population has its own set of weights and biases, which are evolved over multiple generations to optimize performance.

One of the critical additions in the ANN_MLP_GA class is the management of this population. The class maintains vectors for the population’s weights (vWeightsPop) and biases (vBiasesPop). Since GA requires multiple copies of the network (each representing a different individual in the population), memory management becomes a key consideration. The class ensures that each population member has its own distinct set of weights and biases, which significantly increases memory requirements compared to SGD.

The class also implements methods like CreatePopulation and AllocatePopulation to initialize and manage the population. These methods can generate either fixed or mixed populations depending on the configuration, with an option to keep the previous population if necessary. Additionally, the UpdateWeightsAndBiases method is responsible for transferring the weights and biases from a population member to the main network, ensuring that the feedforward computations are performed using the correct parameters.

In the genetic algorithm approach, the TrainGA method iterates through multiple generations, selecting top-performing networks and applying genetic operators such as crossover and mutation to evolve the population. The TestGA method, similar to its counterpart in SGD, evaluates the performance of the population on test data to measure the network’s accuracy.


namespace nn
{
    template <typename T> class ANN_MLP_GA : public ANN_MLP<T>
    {
        using ANN_MLP<T>::vBiases;
        using ANN_MLP<T>::vWeights;
        // using most of the base class

      public:

        void UpdateWeightsAndBiases(const std::vector<size_t>& v_);

        void feedforward(const T* pInputs, size_t inputsSize, T* pOutputs, size_t outputsSize, size_t memberid,
                         bool singleReturn);
        void feedforward(const std::vector<T>& vInputs, std::vector<T>& vOutputs, size_t memberid, bool singleReturn);
        void TrainGA(const std::vector<std::vector<T>>& data, const std::vector<std::vector<T>>& reference,
                     size_t nGenerations, size_t BatchSize, bool shuffleTrainingData = true);
        int TestGA(const std::vector<std::vector<T>>& data, const std::vector<std::vector<T>>& reference);

        void SetMixed(bool b);
        bool GetMixed();

        void CreatePopulation(bool bKeepPrevious = true);

    };

} // namespace nn

Overall, the ANN_MLP_GA class leverages the structure of the base class while introducing population management and evolutionary operations necessary for training FNNs using genetic algorithms. This approach allows the network to explore a broader search space and optimize its parameters through evolutionary processes rather than gradient-based methods.

Population creation

In my implementation of the genetic algorithm (GA) for feedforward neural networks (FNNs), the population is created in two distinct ways: a fixed linear combination and a random mixed combination. Both approaches allow flexibility in generating new individuals for the next generation, and there is the option to preserve the top-performing individuals from the current generation to ensure that the best solutions are retained.

The first method, CreatePopulationFixed, linearly combines the top-performing networks from the previous generation. If the option to keep the top performers is enabled (bKeepPrevious), the method begins by copying the top individuals’ weights and biases directly into the new population. Afterward, the remaining population members are created by averaging the weights and biases of two different top performers. This averaging ensures that the new individuals are combinations of the best-performing networks, which helps guide the search process towards promising regions of the solution space. Once the top performers have been combined, any remaining population members are randomly initialized, using slight variations (mutations) on the top performers’ weights and biases to maintain diversity in the population.

The second method, CreatePopulationMixed, randomly selects two different top-performing networks and combines their weights and biases. For each layer of the network, a random selection is made between the two parents to decide which parent’s weights to copy to the new individual. This allows for a more diverse combination of genetic material, as opposed to the fixed averaging approach. After the new individual’s weights and biases have been initialized, a mutation is applied to introduce some random changes, ensuring that the population explores new areas of the solution space. As with the fixed method, the top performers can be preserved in the new generation if bKeepPrevious is set to true.


template <typename T>
void nn::ANN_MLP_GA<T>::CreatePopulationFixed(bool bKeepPrevious)
{
    std::lock_guard<std::mutex> lock(mtx);
    size_t a_ = 0, b_ = 0;
    // linearly combine the population
    [&] {
        // keep the top of the previous generation
        if (bKeepPrevious)
            for (size_t i = 0; i < nTop; ++i)
            {
                for (size_t k = 0; k < nLayers - 1; ++k)
                {
                    vBiasesPop[i][k].assign(vBiases[i][k].data());
                    vWeightsPop[i][k].assign(vWeights[i][k].data());
                }
                ++a_;
                if (a_ == nPopSize) return;
            }
        for (size_t i = 0; i < nTop; ++i)
            for (size_t j = i + 1; j < nTop; ++j)
            {
                for (size_t k = 0; k < nLayers - 1; ++k)
                {
                    vBiasesPop[a_][k].Zeros();
                    vBiasesPop[a_][k] += vBiases[i][k];
                    vBiasesPop[a_][k] += vBiases[j][k];
                    vBiasesPop[a_][k] *= static_cast<T>(0.5);
                    vWeightsPop[a_][k].Zeros();
                    vWeightsPop[a_][k] += vWeights[i][k];
                    vWeightsPop[a_][k] += vWeights[j][k];
                    vWeightsPop[a_][k] *= static_cast<T>(0.5);
                }
                ++a_;
                if (a_ == nPopSize) return;
            }
    }();
    while (a_ < nPopSize)
    {
        // create random population
        for (size_t i = 0; i < nLayers - 1; ++i)
        {
            vBiasesPop[a_][i].Zeros();
            vBiasesPop[a_][i] += vBiases[b_][i];
            vBiasesPop[a_][i] *= static_cast<T>(0.9);
            vWeightsPop[a_][i].Zeros();
            vWeightsPop[a_][i] += vWeights[b_][i];
            vWeightsPop[a_][i] *= static_cast<T>(0.9);
            const size_t nRows = vSize[i + 1], nCols = vSize[i];
            for (size_t n = 0; n < nRows; ++n)
            {
                vBiasesPop[a_][i][n][0] += static_cast<T>(0.1) * GetRandomNormal();
                for (size_t m = 0; m < nCols; ++m)
                    vWeightsPop[a_][i][n][m] += static_cast<T>(0.1) * GetRandomNormal();
            }
        }
        // chose the next top performer for the following mutation
        b_ = (b_ + 1) % nTop;
        ++a_;
    }
}

template <typename T>
void nn::ANN_MLP_GA<T>::CreatePopulationMixed(bool bKeepPrevious)
{
    // randomly combine the population
    std::lock_guard<std::mutex> lock(mtx);
    size_t a_ = 0, b_ = 0, count_ = 0;
    if (bKeepPrevious)
        for (size_t i = 0; i < nTop; ++i)
        {
            for (size_t k = 0; k < nLayers - 1; ++k)
            {
                vBiasesPop[i][k].assign(vBiases[i][k].data());
                vWeightsPop[i][k].assign(vWeights[i][k].data());
            }
            count_++;
        }
    for (size_t i = count_; i < nPopSize; ++i)
    {
        // select two random individuals
        a_ = static_cast<size_t>(GetRandomUniformInt() % static_cast<int>(nTop));
        b_ = static_cast<size_t>(GetRandomUniformInt() % static_cast<int>(nTop));
        // ensure that they are different
        while (a_ == b_)
            b_ = static_cast<size_t>(GetRandomUniformInt() % static_cast<int>(nTop));
        // create a new individual randomly combining the two selected
        for (size_t j = 0; j < nLayers - 1; ++j)
        {
            // get a random number between 0 and 1 to decide which parent to choose
            const size_t p = (GetRandomUniformReal() < 0.5) ? a_ : b_;
            // copy the parent's weights
            vBiasesPop[i][j].assign(vBiases[p][j].data());
            vBiasesPop[i][j] *= static_cast<T>(0.9);
            vWeightsPop[i][j].assign(vWeights[p][j].data());
            vWeightsPop[i][j] *= static_cast<T>(0.9);
            // mutate the weights
            const size_t nRows = vSize[j + 1], nCols = vSize[j];
            for (size_t n = 0; n < nRows; ++n)
            {
                vBiasesPop[i][j][n][0] += static_cast<T>(0.1) * GetRandomNormal();
                for (size_t m = 0; m < nCols; ++m)
                    vWeightsPop[i][j][n][m] += static_cast<T>(0.1) * GetRandomNormal();
            }
        }
    }
}

Both methods ensure a balance between exploiting the best solutions found so far and exploring new potential solutions. The ability to choose between a fixed combination and a mixed combination gives more control over how conservative or exploratory the population evolution process will be. This flexibility is key in fine-tuning the performance of the genetic algorithm during training.

Training

In my implementation of genetic algorithm (GA) training for feedforward neural networks (FNNs), the TrainGA function is designed to handle tasks that are not easily solvable with stochastic gradient descent (SGD). Unlike SGD, which relies on repetitive updates based on a fixed output and known error gradients, GAs evolve a population of networks by evaluating their performance on specific tasks and selecting the top performers to create the next generation. This makes GAs particularly useful for tasks like playing games or optimizing scenarios where the output is not easily defined.

However, in an effort to compare the GA approach with SGD in a well-defined setting, I have adapted this training function for use with MNIST classification. The idea is to apply the GA algorithm to a similar mini-batch training concept as used in SGD. During each generation, the entire population is tested against the MNIST dataset in mini-batches. The fitness of each network in the population is determined based on its classification accuracy for a given mini-batch.

The TrainGA method starts by shuffling the training data if necessary and creating a new population for the current generation. Each member of the population is then tested on the mini-batch, with the feedforward process applied using the weights and biases of the individual. After processing the mini-batch, the fitness score for each network is updated based on how many correct classifications it made.


for (size_t i = 0; i < nGenerations; ++i)
{
    // populate and shuffle the iterator of integers for training
    std::iota(s_.begin(), s_.end(), 0);
    if (shuffleTrainingData && !(i % 10))
    {
        std::cout << "Reshuffling training data\n";
        std::shuffle(s_.begin(), s_.end(), g_);
    }
    // create the population
    CreatePopulation();
    // reset the fitness vector
    std::fill(f_.begin(), f_.end(), 0);
    // all population is tested vs the reference
    for (size_t j = 0; j < std::min(s_.size(), BatchSize); ++j)
    {
        it_ = s_.begin() + (ptrdiff_t)j;
        // first activation layer - input data. Is never overwritten and therefore can be allocated only once
        na_[0].assign(data[*it_]);
        for (size_t k = 0; k < nPopSize; ++k)
        {
            // first activation layer - input data.
            vNaPop[k][0].assign(data[*it_]);
            // feedforward the network
            for (size_t l = 1; l < nLayers; ++l)
            {
                la::MatMultVec(vNaPop[k][l], vWeightsPop[k][l - 1], vNaPop[k][l - 1]);
                vNaPop[k][l] += vBiasesPop[k][l - 1];
                nn::ActFunc(vNaPop[k][l], pAct);
            }
            // compute the fitness (+1 if the result is correct)
            typename std::vector<T>& res = vNaPop[k][nLayers - 1].data();
            const typename std::vector<T>::iterator it2_ = std::max_element(res.begin(), res.end());
            const ptrdiff_t maxPos = std::distance(res.begin(), it2_);
            if (reference[j][static_cast<size_t>(maxPos)] == 1) f_[k]++;
        }

        // update to the next sample
        it_++;
    }
    // sort in descending order and store the index
    std::iota(v_.begin(), v_.end(), 0);
    std::sort(v_.begin(), v_.end(), [&](size_t i, size_t j) { return f_[i] > f_[j]; });
    // update the weights
    {
        // vector length is fixed
        const size_t length_ = vBiases[0].size();
        for (size_t j = 0; j < nTop; ++j)
        {
            for (size_t k = 0; k < length_; ++k)
            {
                vBiases[j][k].assign(vBiasesPop[v_[j]][k].data());
                vWeights[j][k].assign(vWeightsPop[v_[j]][k].data());
            }
        }
    }
}

Once the entire mini-batch has been processed, the networks in the population are sorted based on their fitness scores. The top-performing networks are then used to update the primary weights and biases for the next generation. This ensures that the best networks continue to influence the population, while new members are generated through crossover and mutation.

Although the GA implementation can be applied to MNIST classification, the results are generally poor compared to SGD, as GAs are not well-suited for tasks like image recognition, where precise and consistent weight adjustments are needed. GAs tend to excel in environments where the problem space is highly non-linear, discontinuous, or difficult to differentiate, such as in strategy optimization or game playing. In contrast, tasks like MNIST classification benefit from the more direct and efficient error-correction approach of SGD.

Testing

The TestGA function in the ANN_MLP_GA class is similar to the TestSGD method. It runs the feedforward process for each input in the test dataset and compares the predicted output to the reference output to measure the performance of the genetic algorithm (GA)-trained network.

In this method, the test data is passed through the network layer by layer, just like in SGD testing. For each input sample, the activation values in the input layer (na_[0]) are initialized with the input data, and the network’s feedforward process begins. The activations are multiplied by the weights, and biases are added at each layer. The activation function (sigmoid or tanh, depending on the configuration) is then applied to the results.

Once the output layer produces its final activation values, the network’s prediction is determined by finding the index of the maximum value in the output vector, which corresponds to the predicted class. This predicted index is then compared to the index of the maximum value in the reference output (the correct class). If the predicted index matches the reference index, the iCorrect counter is incremented.

At the end of the testing loop, iCorrect returns the total number of correct predictions, providing an accuracy measure for how well the network, trained using GA, performs on the test data.

Overall, the TestGA function mirrors the testing logic of TestSGD but applies it to the GA-trained population, verifying the network’s predictions against the ground truth in the same manner.

Complete code

The complete code is part of my libraries “MA libs” and it available on GitHub. The repository contains detailed instructions on how to set up the environment and compile the libraries and it is available at the following link.

Go to the top of the page