Building A GPT From Scratch: Positional Encoding (PE)

Quantum
Quest
Algorithms, Math, and Physics

Building a GPT from scratch: positional encoding (PE)

In this post, I detail the recent enhancements I made to my GPT model. After upgrading the tokenizer by integrating tiktoken for better handling of input, I implemented a positional encoding (PE) layer, a crucial step to improve the model’s ability to process sequential data. The use of sinusoidal PE helps the model understand the relative position of tokens, a vital improvement for any transformer-based architecture. Additionally, I included a simple feedforward neural network layer to refine the output before moving on to adding the attention mechanism, setting the stage for further advancements.

Tokenizer

The quality of model output heavily depends on how well the input is tokenized. Therefore, I switched to using the tiktoken library, which is designed specifically for models like GPT-3. I also incorporated a step that replaces numbers with a <NUM> token, ensuring better generalization during training. Below is an excerpt of the code used:


import tiktoken
import re

def advanced_tokenizer(text):
    # Replace all numbers with a <NUM> token for better generalization
    text = re.sub(r'\d+', '<NUM>', text)
    # Initialize tiktoken's GPT-3 tokenizer
    enc = tiktoken.get_encoding("gpt2")
    # Add special tokens
    special_tokens = ["<PAD>", "<BOS>", "<EOS>", "<UNK>"]
    # Tokenize the text and obtain token IDs
    token_ids = enc.encode(text)
    # Decode back to tokens for readability
    tokens = enc.decode(token_ids).split()
    # Prepend special tokens
    tokens = special_tokens + tokens
    # Create a vocabulary mapping tokens to unique integer indices
    vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}

    return tokens, vocab

Positional Encoding (PE)

In transformer models like GPT, the attention mechanism doesn’t inherently track the order of tokens in a sequence. Therefore, positional encoding is required to provide information about the token positions. I added a sinusoidal positional encoding to my model, a well-established method in transformers.


import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x is expected to be of shape (batch_size, seq_len, d_model)
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len]
        return x

The positional encoding helps the model to understand the sequence of tokens, making it much more effective at handling tasks that require sequential input.

Feedforward Neural Network

Finally, I integrated a simple feedforward neural network to act as a post-processing step in my model. This network refines the representations before moving on to the attention mechanism, improving the overall performance.


class FeedForwardNN(nn.Module):
    def __init__(self, d_model, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(d_model, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, d_model)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

By making these changes, my GPT model is now better equipped to handle the sequential data input effectively, while the feedforward network enhances its ability to capture more complex patterns.

For more insights into this topic, you can find the details here.