Building a GPT from scratch: positional encoding (PE)
In this post, I detail the recent enhancements I made to my GPT model. After upgrading the tokenizer by integrating tiktoken
for better handling of input, I implemented a positional encoding (PE) layer, a crucial step to improve the model’s ability to process sequential data. The use of sinusoidal PE helps the model understand the relative position of tokens, a vital improvement for any transformer-based architecture. Additionally, I included a simple feedforward neural network layer to refine the output before moving on to adding the attention mechanism, setting the stage for further advancements.
Tokenizer
The quality of model output heavily depends on how well the input is tokenized. Therefore, I switched to using the tiktoken
library, which is designed specifically for models like GPT-3. I also incorporated a step that replaces numbers with a <NUM>
token, ensuring better generalization during training. Below is an excerpt of the code used:
import tiktoken
import re
def advanced_tokenizer(text):
# Replace all numbers with a <NUM> token for better generalization
text = re.sub(r'\d+', '<NUM>', text)
# Initialize tiktoken's GPT-3 tokenizer
enc = tiktoken.get_encoding("gpt2")
# Add special tokens
special_tokens = ["<PAD>", "<BOS>", "<EOS>", "<UNK>"]
# Tokenize the text and obtain token IDs
token_ids = enc.encode(text)
# Decode back to tokens for readability
tokens = enc.decode(token_ids).split()
# Prepend special tokens
tokens = special_tokens + tokens
# Create a vocabulary mapping tokens to unique integer indices
vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}
return tokens, vocab
Positional Encoding (PE)
In transformer models like GPT, the attention mechanism doesn’t inherently track the order of tokens in a sequence. Therefore, positional encoding is required to provide information about the token positions. I added a sinusoidal positional encoding to my model, a well-established method in transformers.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # shape: (1, max_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x):
# x is expected to be of shape (batch_size, seq_len, d_model)
seq_len = x.size(1)
x = x + self.pe[:, :seq_len]
return x
The positional encoding helps the model to understand the sequence of tokens, making it much more effective at handling tasks that require sequential input.
Feedforward Neural Network
Finally, I integrated a simple feedforward neural network to act as a post-processing step in my model. This network refines the representations before moving on to the attention mechanism, improving the overall performance.
class FeedForwardNN(nn.Module):
def __init__(self, d_model, hidden_dim):
super().__init__()
self.fc1 = nn.Linear(d_model, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, d_model)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
By making these changes, my GPT model is now better equipped to handle the sequential data input effectively, while the feedforward network enhances its ability to capture more complex patterns.
For more insights into this topic, you can find the details here.