Building A GPT From Scratch: Tokenizer And Data Loader

2025-04-10

Building a GPT from scratch: tokenizer and data loader

In this blog post, I present the initial steps in building a GPT model using PyTorch. This includes creating a tokenizer to process raw text and a data loader to efficiently manage and iterate through the dataset during training. By working through these fundamental components, I set up the foundation required for handling large textual data, which will later feed into the transformer-based architecture of a GPT model.

Tokenizer

The tokenizer is a crucial step in any natural language processing task. In this project, I work with Divina Commedia by Dante Alighieri, using the text as the foundation for the model. The purpose of the tokenizer is to convert the raw text into a numerical format, which the machine learning model can interpret and process.

In practice, this tokenizer does three things:

Tokenization: It breaks the raw text into discrete units of meaning, typically words or subwords.
Vocabulary Mapping: Each unique token is mapped to a unique integer.
Tensor Conversion: The resulting sequence of token IDs is converted into a PyTorch tensor.

Here the code I used to implement the tokenizer:


import torch

# Tokenize the text by splitting on whitespace
tokens = text.split()

# Create a vocabulary mapping each unique token to a unique integer
vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}

# Convert the list of tokens into a list of token IDs
token_ids = [vocab[token] for token in tokens]

# Convert the list of token IDs into a PyTorch tensor
token_tensor = torch.tensor(token_ids, dtype=torch.long)

This code reads the Divina Commedia, tokenizes it by splitting on whitespace, and creates a vocabulary mapping. Each token is then converted into a list of token IDs, which I further transform into a PyTorch tensor. The tensor represents the tokenized data, ready for further processing.

By running this code, I observe that the vocabulary size is approximately 20,978 unique tokens. This gives me an understanding of the text’s complexity. A brief look at the first 10 tokens and their corresponding IDs further confirms that the tokenization is proceeding as expected.

Data loader

Once the text is tokenized, I need an efficient way to load and handle this data during model training. This is where the PyTorch DataLoader comes in. It allows me to split the dataset into training and evaluation sets, shuffle the data, and load it in batches. These steps are crucial for managing the large datasets commonly encountered in natural language processing.

I split the dataset with an 80-20 ratio, where 80% of the data is used for training, and 20% is reserved for evaluation. This split ensures that the model does not overfit to the training data by giving me an opportunity to check its performance on unseen data.

Here is the code I used to implement the data loader:


import torch
from torch.utils.data import Dataset, DataLoader

# Parameters
batch_size = 32
split_ratio = 0.8

num_train = int(len(token_tensor) * split_ratio)

# Split the dataset into training and evaluation sets
train_tensor = token_tensor[:num_train]
eval_tensor = token_tensor[num_train:]

# Define a simple dataset wrapper
class TokenDataset(Dataset):
    def __init__(self, data_tensor):
        self.data_tensor = data_tensor

    def __len__(self):
        return len(self.data_tensor)

    def __getitem__(self, index):
        return self.data_tensor[index]

# Create training and evaluation datasets
train_dataset = TokenDataset(train_tensor)
eval_dataset = TokenDataset(eval_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
eval_loader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=False)

The code is designed to handle large batches of tokenized data efficiently. I set the batch size to 32, meaning that in each iteration of the model, it processes 32 tokens at a time. The training data is shuffled to ensure that the model does not learn any unintended sequence patterns from the original text, which would otherwise bias the training process.

After implementing the data loader, I verified that the training and evaluation sets are correctly split by printing out the shape of the first batch from both loaders. I confirmed that the training and evaluation data are being processed as expected, ready for model training.

Conclusion

By implementing both a tokenizer and a data loader, I have set up the components needed for processing text in a GPT model. Tokenization allows me to convert raw text into a numerical format, while the data loader efficiently manages this data during model training. These steps ensure that I can handle large datasets effectively, making them ready for the next stage in GPT development.

For more insights into this topic, you can find the details here.

python github artificial-intelligence gpt large-language-models pytorch

Understanding Generative Pre-trained Transformers (GPT) Building a GPT from scratch: token embeddings and training