Building A GPT From Scratch: Multiple Blocks And Training

2025-04-25

Building a GPT from scratch: multiple blocks and training

I’ve focused on enhancing my GPT model by incorporating multiple attention blocks, an approach that has significantly improved the model’s ability to understand complex patterns in sequences. Each block in the model includes a multi-head attention mechanism coupled with a feedforward neural network, allowing the model to progressively build a deeper understanding of the contextual relationships between tokens.

Building upon the basic structure of a Generative Pre-trained Transformer (GPT), I decided to enhance the model’s capabilities by stacking several GPT blocks. This enhancement aims to refine the representations of tokens through multiple layers of attention and feedforward networks.

The core of the enhanced model is the MyGPT class, where I stack multiple instances of MyGPTBlock. Each block is designed to capture different aspects of the input through its attention mechanisms, which can either be single-head or multi-head based on the configuration. Here’s a brief look at the model architecture in Python:


class MyGPT(nn.Module):
    def __init__(self, vocab_size, d_model, max_len, hidden_dim, dropout_prob, n_layers, num_heads, use_multiple_head):
        super(MyGPT, self).__init__()
        self.blocks = nn.ModuleList([
            MyGPTBlock(d_model, hidden_dim, dropout_prob, num_heads, use_multiple_head)
            for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.fc_out = nn.Linear(d_model, vocab_size)

class MyGPTBlock(nn.Module):
    def __init__(self, d_model, hidden_dim, dropout_prob, num_heads, use_multiple_head):
        super(MyGPTBlock, self).__init__()
        if use_multiple_head:
            self.attention = MultiHeadAttention(d_model, num_heads)
        else:
            self.attention = SingleHeadAttention(d_model)
        self.ffn = nn.Sequential(nn.Linear(d_model, hidden_dim), nn.GELU(), nn.Linear(hidden_dim, d_model))
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout_prob)

Training Routine

The training routine follows a typical procedure for neural networks but is specifically tuned to accommodate the complexities introduced by the stacked blocks. I use an Adam optimizer with a learning rate tailored to the needs of the deep architecture to ensure efficient convergence.

After extensive training over 500 epochs, the enhanced GPT model demonstrates improved capability in generating coherent and contextually relevant text. The phrases exhibit a more natural progression and connectivity, reflecting the model’s deeper understanding of language structure.


$ python run_model.py
Using tiktoken-based generation
Generated sequence: Nel mezzo del cammin di nostra vita bella; l'altr' ier, trenta gran rabbia fiorentina,
'sipa' Nel vano l'udire ch'entro l'affoca fa vergogna; assai chiara favella,
«Se t'ammentassi da lui acquistar, questa Tolomea, più che 'l mondo spersi? Ché e non vi nòi».
«Donna se non son digiuno». anella è colui che tu chi è quel punto saltò

The output shows improvements in sentence flow and phrase coherence, demonstrating the advantages of using stacked attention blocks to enhance the text generation process.

Stacking multiple GPT blocks has proven effective in enhancing the model’s performance, making it a suitable approach for anyone looking to explore advanced applications in natural language processing. The architectural choices and training strategies discussed here should provide a solid foundation for further experimentation and development.

For more insights into this topic, you can find the details here.

python github artificial-intelligence large-language-models gpt pytorch

Building a GPT from scratch: adding attention Building a GPT from scratch: introspection and loss visualization