I have already provided an overview of Large Language Models (LLMs) here. Now, I would like to focus on Generative Pre-trained Transformers (GPT), which have recently gained significant attention.
GPT are a type of language model that combines generative capabilities, pre-training on large text, and the Transformer architecture. GPT models generate coherent and contextually relevant text based on an input prompt, making them useful in tasks like text completion, summarization, and dialogue systems.
The “pre-training” refers to the model’s initial phase of being trained on vast amounts of text, which allows it to learn language patterns, grammar, and context before being fine-tuned on specific tasks.
The Transformer component, specifically the decoder-only architecture, enables efficient text generation by attending to all positions in the input sequence simultaneously. This eliminates the limitations of recurrent networks and allows GPT to handle long-range dependencies effectively.
A trained language model generates text by leveraging patterns it has learned during training. Optionally, input text can be provided, which influences the model’s generated output. During training, the model is exposed to vast amounts of text, from which it learns to predict the next word in a sequence. Each prediction starts off inaccurate, and the error is calculated to update the model for improved accuracy in future predictions.
This process is repeated many times, allowing the model to encode its learned knowledge into several billion parameters. These parameters are then used to determine which token to generate with each run. Initially, the model begins with random parameters, but through training, it converges to values that result in better text generation.
The model operates within a context window, which is a fixed-size portion of the input text that the model can attend to when generating new tokens. The size of this context window limits how much prior text the model can consider when making predictions. If the input exceeds this window, earlier tokens are ignored. Larger context windows allow the model to handle longer sequences, providing more context for generating relevant and coherent outputs.
The user can optionally provide a prompt (a set of instructions or initial text), which comes into play at the start of the text generation process. This prompt sets the initial context within the model’s fixed-size context window, guiding the model’s predictions. The prompt helps frame the generated output, ensuring that the text continues coherently and is relevant to the instructions or context provided. As the model generates new tokens based on the prompt, if the combined input and generated text exceed the context window, the earliest tokens are dropped, and the model focuses on the most recent tokens.
When generating text, the process begins by converting each input word into a vector—a list of numbers that numerically represents the word. These vectors are then processed through a stack of transformer decoder layers, each layer containing billions of parameters that perform intricate calculations to predict the next word in the sequence.
After processing, the resulting vector is converted back into a word, completing the generation cycle. This sequence of converting words to vectors, processing them through the transformer layers, and converting the output vectors back to words enables the model to generate coherent and contextually appropriate text based on the input it receives.
As I also explained in my previous post, a model can be fine-tuned by updating its weights to improve performance on specific tasks. Fine-tuning adjusts the pre-trained model’s parameters to better align with the requirements of a particular dataset or application, allowing the model to generate more accurate and task-specific outputs.
I will now implement a sample GPT which is trained on Dante Alighieri Divina Commedia text.
A tokenizer is a fundamental tool in natural language processing (NLP) that transforms raw text into a sequence of tokens. Tokens are the basic units of meaning—such as words, subwords, or characters—that algorithms and models use to process and understand text data. Tokenization serves several critical purposes:
Normalization: It helps standardize text by converting it into a consistent format, such as lowercasing all words or removing punctuation.
Segmentation: It breaks down complex text into manageable pieces, making it easier for models to analyze patterns and relationships.
Numerical Mapping: It converts tokens into numerical representations (usually integers), enabling their use as inputs in machine learning models.
Without tokenization, models would struggle to interpret the raw text, as they require numerical inputs to perform computations.
Now, we will build a tokenizer using PyTorch
. The tokenizer will read text from a file, split it into tokens, map each token to a unique integer, convert the sequence of tokens into a PyTorch tensor, and save it.
Below is the Python code that accomplishes this:
The output will would be similar to:
A data loader in PyTorch is an abstraction that provides an efficient way to iterate through large datasets. It allows batching, shuffling, and parallel loading of data, which is especially useful during the training of models. The data loader wraps a dataset and returns an iterator over the data, enabling easy access to mini-batches during training and evaluation.
For effective training, datasets are typically split into two (or more) subsets:
We will now build a basic data loader that splits the tokenized dataset into training and evaluation sets.
Below is the Python code for this section:
The output will would be similar to:
This data loader provides an efficient way to access the tokenized data in batches, ready to be fed into the GPT model for training and evaluation.
Now that we have created a dataloader, we can start training the model.
The parameters used are as follows:
The context length refers to the maximum number of tokens that the model processes in one input sequence. In this case, the context length is set to 512, meaning that each training example will consist of up to 512 tokens. The model will learn to predict the next token based on these 512 tokens, helping it capture dependencies between words over a limited window.
In PyTorch, the context_length
is passed to the dataset class, which prepares sequences of that length from the tokenized data. This ensures that when the dataloader retrieves batches, each sequence has the correct number of tokens, providing a fixed-length input for the model. Keeping the context length manageable is important because increasing it significantly increases memory usage and computation time.
This approach is widely used in transformer-based models like GPT because it allows the model to capture both short- and long-range dependencies within the token sequence, up to the limit of the context length.
Then we can initialize the model:
The variable d_model
defines the size of the embeddings, which will represent each token in the sequence. The GPT model is initialized with the vocab_size
(number of tokens in the vocabulary) and d_model
(the dimensionality of the token embeddings). The model is moved to the specified device
.
In the MyGPT
class, the constructor defines the layers. self.wte
creates an embedding layer where each token from the vocabulary is mapped to a dense vector of size d_model
. self.ln_f
normalizes these embeddings to stabilize training, and self.fc_out
is the final linear layer that maps the processed embeddings back to the size of the vocabulary to predict the next token.
The forward
method performs the forward pass of the model. First, the input tokens are converted into embeddings using the wte
layer. These embeddings are passed through the layer normalization (ln_f
) and then fed to the output layer (fc_out
) to produce logits. The shape of logits
will be (batch_size, sequence_length, vocab_size)
.
If target labels are provided (i.e., during training), the model calculates the cross-entropy loss. Both logits
and targets
are reshaped to match the expected shape for the loss computation. The loss is computed using F.cross_entropy
.
The method returns the predicted logits and the loss (if targets are provided). If no targets are given, only the logits are returned for inference.
The optimizer is initialized using Adam, a commonly used variant of gradient descent. The model.parameters()
function provides all the parameters of the model that need to be updated, and lr
is the learning rate that controls the step size during optimization.
The training loop runs for a fixed number of epochs (num_epochs
). At the start of each epoch, the model is set to training mode with model.train()
, and total_loss
is initialized to zero to accumulate the loss over the entire epoch.
In each batch of the training loop, the input tokens (inputs
) are selected from the sequence by excluding the last token. The targets
are the shifted sequence, starting from the second token onward. This prepares the model to predict the next token in the sequence.
Before each update, optimizer.zero_grad()
clears the previously accumulated gradients. Then, the model computes the logits and the loss using the forward
pass. The loss.backward()
function computes the gradients with respect to the model parameters, and optimizer.step()
updates the parameters using those gradients.
The loss for the current batch is added to total_loss
, which keeps track of the cumulative loss for the entire epoch.
At the end of the epoch, the average training loss (avg_tl
) is calculated by dividing the total_loss
by the number of batches in the training data. The function print_time
outputs the epoch number and the average training loss.
After training, the model switches to evaluation mode with model.eval()
, which disables certain layers like dropout. The torch.no_grad()
context is used to disable gradient calculation, improving memory usage and speed. Similar to training, the model is evaluated on the validation data, with inputs and targets prepared in the same way.
At the end of validation, the average evaluation loss (avg_evl
) is calculated. The function print_time
outputs the epoch number and the average evaluation loss for monitoring the model’s performance.
At the end of training, the model’s configuration and weights are saved using torch.save()
. The dictionary being saved contains the vocab_size
, d_model
, and the state_dict()
of the model, which holds all the trained parameters (weights and biases). This allows for the model to be easily reloaded later for inference or further training.
In conclusion, saving the model’s state (specifically the weights) ensures that it can be reused later without the need for retraining. While the training process can be very time-consuming, especially for large transformer models, decoding (or inference) is much faster, as it only involves passing the input through the model using the precomputed weights. This is a typical approach in transformer-based architectures like GPT. The model can be deployed and used for generating sequences quickly, making it highly efficient for real-time applications.
The model was trained for 150 epochs. Below is a snippet of the training progress:
As observed, the training loss barely changed between epochs 60 and 150, indicating that the model’s improvement plateaued. Given the simplicity of the model, this result is expected. The next step is to use this trained model for text prediction.
In this logic the text is generated based on an initial input sequence. We begin by loading the trained GPT model, the tokenized vocabulary, and other parameters required for text generation:
The tokenizer converts text into token IDs, and a reverse mapping (id_to_token
) is created to map token IDs back to human-readable text. The trained model is then loaded, and it is placed into evaluation mode.
The generate_text
function accepts a starting sequence of text and a maximum number of new tokens to generate. First, the input text is encoded by converting the words into corresponding token IDs using the vocabulary dictionary:
The tokenized input is then fed into the model to predict the next tokens. The model generates text iteratively, adding one token at a time to the existing sequence. The generate
method handles this process:
The generation loop runs for max_new_tokens
iterations. During each iteration, the model is provided with the current token sequence. The logits (the raw output scores from the model) are calculated for the next possible tokens:
Temperature scaling is applied to adjust the randomness of the output. A higher temperature makes the model more exploratory, while a lower temperature favors more likely tokens:
Top-k sampling is then applied, which limits the possible next tokens to the most likely k
candidates. This prevents the model from choosing highly unlikely tokens:
The logits are transformed into a probability distribution using the softmax function, and a token is sampled from this distribution:
Each newly generated token is appended to the existing sequence of tokens, and the process repeats for the desired number of tokens:
The output is a sequence of token IDs, which is then transformed into words using the id_to_token
dictionary:
Finally, the generate_text
function is used to invoke the generation, and the text is printed out:
The final output is a concatenation of the original input sequence and the newly generated text. The generated text can vary significantly depending on the parameters such as temperature and top-k value, providing either creative or more deterministic results.
For demonstration purposes, the simple GPT model was used to generate text based on an initial prompt. The model was trained for 150 epochs, but as it is quite basic, the quality of the output is expected to be limited.
We start with a well-known line from Dante’s Divine Comedy:
The model then generates a continuation based on this input. Below is the generated output:
As we can observe, the generated text lacks coherence and fails to maintain a meaningful flow, which is expected given the simplicity of the model. Despite the initial structure resembling natural language, the generated continuation quickly becomes disjointed and nonsensical.
This result provides a baseline to which we can compare the improved model once the attention mechanism is added, allowing the model to better capture relationships between tokens in the input sequence.
While the basic model produces results, they are not yet impressive. However, several enhancements can significantly elevate the model’s quality. Even a homegrown GPT can achieve decent performance with the right adjustments.
The quality of the model output depends heavily on the input tokenization process. For this reason, I opted to rely on external libraries that are specifically designed for natural language processing. One such library is tiktoken
, which is tailored for tokenizing text in models like GPT-3.
For the tokenizer, using a well-established external library is essential to ensure that the input is efficiently processed. In my case, I chose tiktoken
as it is designed for natural language modeling and aligns well with the requirements of GPT models.
No changes were made to the data loader, as PyTorch already provides efficient and reliable data handling mechanisms. Its built-in utilities guarantee proper preprocessing, batching, and shuffling of data, which are essential for model training.
Adding positional encoding (PE) is necessary in transformer architectures, including GPT models, because the attention mechanism itself is “permutation invariant” — meaning it doesn’t inherently know the position of tokens in a sequence. The PE provides a way to inject position information into the model so that it can understand the order of tokens in the input.
The neural network layer, usually a feedforward network applied after the attention mechanism, helps process the information further and improves the model’s ability to capture complex relationships. It’s part of the standard transformer block and helps refine the representations learned by the attention layer.
We will add a sinusoidal positional encoding, commonly used in transformer models.
A neural network will act as the post-attention transformation, but for now, we’ll include it before attention. It’s a simple two-layer feedforward network with activation.
These two layers were integrated into the model, and a few runs were made to ensure that the overall model is still properly working before adding the attention mechanism.
We can now add a single-head attention layer. First the attention layer is created:
This is the SingleHeadAttention
class, which implements a single-headed self-attention mechanism. In this setup, three linear layers are used to project the input into query, key, and value vectors. These vectors are crucial for computing the attention scores. The scaling factor is added to the dot-product of the query and key to stabilize the gradients, especially for large values of d_model
.
In the forward pass, the input embeddings x
are passed through the query
, key
, and value
linear layers to generate the Q
, K
, and V
matrices. These projections are used to compute attention scores and derive the final attention output.
The attention scores are calculated as the dot product between the query and the transposed key vectors, scaled by the scale
factor. This operation computes the similarity between each token in the sequence.
Next, a causal mask is applied. This mask prevents the model from attending to future tokens (i.e., tokens to the right of the current one in the sequence). It is critical for autoregressive models like GPT. The mask is applied by setting the attention scores of future tokens to -inf
, effectively blocking them from consideration in the attention mechanism.
The attention scores are converted into probabilities using the softmax function, which normalizes the scores along the sequence length dimension. These normalized weights indicate how much attention each token should pay to other tokens in the sequence.
Finally, the attention weights are used to compute the weighted sum of the values (V
), which produces the final attention output. This output is the result of the model attending to relevant parts of the input sequence.
This layer is then added to the model:
In the MyGPT
class, we instantiate the SingleHeadAttention
layer, along with a feedforward neural network (FFN), layer normalization, and the final output layer. The FFN helps the model capture complex patterns by transforming the attention output before generating the final token predictions.
In the forward
method, we first pass the input embeddings through the attention layer to compute the attention output. This output is then processed through the feedforward neural network (FFN), which transforms the representation before passing it to the next layers in the model.
Finally, the output logits are computed, and the loss is optionally calculated if target labels are provided during training.
This complete attention implementation allows the GPT model to attend to relevant parts of the input sequence when making predictions, making the model more powerful and capable of generating coherent text.
The final step is to add multiple layers of attention:
In this implementation of multi-head attention, the attention mechanism is expanded to use multiple attention heads, allowing the model to focus on different parts of the input sequence simultaneously. This contrasts with the single-head attention, where only one attention head computes a single set of attention scores.
The __init__
method starts by ensuring that the input dimensionality (d_model
) is divisible by the number of attention heads (num_heads
). Each head will attend to different subspaces of the input. The dimension of each head (head_dim
) is calculated as d_model // num_heads
, so that when the heads are concatenated, the final output still has dimensionality d_model
.
In the forward
method, the input tensor x
is first projected into query (Q
), key (K
), and value (V
) vectors using three separate linear layers. After this projection, each of the Q
, K
, and V
matrices is reshaped to have multiple heads. This reshaping splits the original tensor into multiple heads of size head_dim
. The resulting shapes are (batch_size, num_heads, sequence_length, head_dim)
, meaning each head processes a part of the input independently.
The attention scores are computed similarly to single-head attention by taking the dot product of the query (Q
) and the transpose of the key (K
), scaled by the inverse square root of head_dim
to ensure stability. After computing the attention scores, a causal mask is applied to prevent attending to future tokens, which is essential for autoregressive models like GPT.
After masking, the attention scores are normalized using the softmax function to compute the attention weights, which represent the importance of each token relative to others in the sequence. The attention weights are then used to compute the weighted sum of the value (V
) vectors, producing the final attention output for each head.
At this stage, the outputs of all the attention heads are concatenated back into a single tensor. This is done by transposing the tensor and reshaping it so that the heads are combined, restoring the original dimensionality of d_model
. The final concatenated output is passed through a final linear layer, which projects the result back into the same dimensional space as the input (d_model
), ensuring consistency for further processing.
The primary difference from single-head attention is that multi-head attention allows the model to compute attention over multiple subspaces of the input at once, enabling it to capture a richer set of relationships between tokens. This makes multi-head attention more powerful, as each head can focus on different parts of the input, leading to more nuanced and diverse attention outputs. In contrast, single-head attention computes a single set of attention scores and outputs, which may be less expressive in capturing complex dependencies in the data.
The changes to the model are minimal, and you can seamlessly switch between single-head and multi-head attention without modifying the rest of the code; this approach maintains the flexibility of switching between single and multi-head attention, making it easy to experiment with different configurations of attention layers.
Stacking multiple GPT blocks improves the model’s ability to understand complex patterns in sequences. Each block adds a new layer of representation to the input data, which enhances the ability to capture long-range dependencies. As more blocks are stacked, the model refines the contextual understanding of the tokens, leading to better generalization and performance. The attention layers can focus on different aspects of the input at each block, while the feedforward network further transforms the data, allowing the model to progressively build more abstract features.
In the above code, multiple blocks are stacked to refine the token representations through multiple layers of attention and feedforward networks. Now, let’s explain the internal structure of a single block in detail.
The single block consists of two main components: attention and a feedforward neural network. The attention layer is responsible for capturing relationships between tokens, allowing the model to focus on different parts of the input sequence simultaneously. This is achieved either with a single attention head or multiple heads, depending on the configuration. Using multiple heads enables the model to focus on different aspects of the sequence concurrently.
After the attention mechanism, the output is normalized using a LayerNorm to stabilize the training process and ensure better convergence. The model then passes the data through a feedforward neural network, which consists of an expansion layer followed by a compression layer. The expansion increases the dimensionality of the input, while the compression brings it back to the original size. This allows for more complex transformations of the data, leading to a richer representation.
Layer normalization is applied once more after the feedforward network, along with a residual connection to help retain the original input, which is crucial for the flow of gradients during backpropagation. Finally, dropout is applied to reduce overfitting by randomly zeroing out some of the weights during training.
Each block, therefore, progressively refines the input representation by learning deeper and more abstract patterns in the data. When stacked, these blocks form a powerful model capable of capturing long-range dependencies and complex relationships between tokens.
The optimizer is initialized using Adam, a commonly used variant of gradient descent. The model.parameters()
function provides all the parameters of the model that need to be updated, and lr
is the learning rate. The training routine remains the same as the basic model, involving forward and backward passes for each batch, updating gradients using the optimizer, and printing the average loss at the end of each epoch.
The process for running the model is unchanged from the basic model case. First, the trained model and vocabulary are loaded, and the input text is tokenized into token IDs. The tokenized sequence is then passed through the model, which predicts the next tokens iteratively. During each iteration, the model generates tokens using temperature scaling and top-k sampling to control randomness and ensure coherent text generation. Once the desired number of tokens is generated, the token IDs are mapped back to text, forming the final generated sequence.
For the advanced GPT model, I used the same initial prompt as with the basic model, but after training the model for 500 epochs. The generated text shows improvement compared to the basic model in terms of phrase structure and flow, though it is still far from perfect. The phrases now exhibit a more natural progression, and there is a sense of continuity between sentences.
The same prompt was used:
The advanced model generated the following text:
The output now shows a better flow of phrases compared to the basic model. Although it still lacks complete coherence and semantic clarity, the generated text is structured in a way that mimics natural speech. The improvement in the phrases’ fluidity reflects the benefits of using a more advanced model.
This concludes the creation and demonstration of the GPT model, showcasing the improvements in text generation through enhanced training and architecture.
To better understand the structure and configuration of the model, I implemented a model introspection function. This function outputs relevant parameters such as the number of layers, dimensions of embeddings, and whether multi-head attention is being used. The introspection also calculates and displays the total number of trainable parameters.
The function first moves the model to the specified device (CPU or GPU) and optionally compiles the model using torch.compile
. It then prints the full model architecture, including details of the embedding layer, positional encoding, stacked GPT blocks, and output layers. Additionally, the configuration of the model, such as the vocabulary size, embedding dimensions, hidden dimensions, dropout probability, and number of attention heads, is displayed.
Finally, the total number of trainable parameters is calculated and printed, giving insight into the overall complexity of the model.
Example output:
Once one or more training runs have been completed, it is possible to visualize the training and validation loss using either a simple or exponential moving average to smooth the data. This provides a clearer view of the trend in the loss values by reducing noise.
Below is an example of how you can visualize the smoothed training and validation loss:
This function visualizes the loss values and their smoothed versions using either a simple or exponential moving average. You can toggle between training and validation loss and control the smoothing effect by modifying the sma_window_size
for the simple moving average or the ema_alpha
for the exponential moving average.
The following plot illustrates the results of a 500-epoch run:
The complete code for the model implementation, training, and visualization are available on GitHub. The repository contains detailed instructions on how to set up the environment, train the model, generate text, and visualize the training and validation losses.
The repository is available at the following link.