This article is based on Andrej Karpathy's 4-hour reproduction of GPT-2. After watching it, I found it to be an excellent video; it serves as the concluding chapter in the evolution of LLMs. This article provides a textual supplement based on that. For earlier content, please refer to the homepage at https://blog.nagi.fun, where the blogger has written very comprehensively.
This series is planned to be divided into three parts: the main implementation, accelerated implementation, and distributed training.
Implementing GPT-2 nn.Module#
Config Configuration#
@dataclass
class GPTConfig():
block_size: int=1024 # Sequence length limit (context window length)
vocab_size: int=50527 # Vocabulary size
n_layer: int=12 # Number of Transformer layers
n_head: int=12 # Number of attention heads
n_embd: int=768 # Embedding dimension (vector length for each token)
@dataclass
decorator defines a configuration class named GPTConfig
(If you don't understand decorators, you can look it up on CSDN or Zhihu)
Why use dataclass:
• Regular classes require manually writing the __init__
method, but with the decorator, it's very simple.
• Supports explicit declaration, and you can directly print(GPTConfig(n_head=16))
to print the parameters.
BackBone#
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
# word token embedding
wte = nn.Embedding(config.vocab_size, config.n_embd),
# word position embedding
wpe = nn.Embedding(config.block_size, config.n_embd),
# main block
h = nn.ModuleList([Block(config) for _ in range(config.n_layers)]),
# word token embedding,
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.attn = CasualSelfAttention(config)
self.mlp = mlp(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
self.transformer
: the core of the transformer architecture.
nn.ModuleDict
: a dictionary within nn.Module, nn.ModuleDict(ln_f = nn.LayerNorm(config.n_embd),)
can be understood as {ln_f: nn.LayerNorm(config.n_embd)}
.
wte
: the linear layer from word to embedding [length of vocabulary, embedding dimension], transforming words into feature vectors.
wpe
: the linear layer from word position to embedding [sequence length, embedding dimension], transforming positional information into feature vectors.
h
: the encoder core of the transformer, where each Block consists of an attention and an mlp.
ln_f
: LayerNorm, normalizing the large variance obtained after Pre-Norm, further explanation follows.
lm_head
: the final output layer, converting the feature vector of the word into a specific word.
Block
: the transformer consists of multiple identical Blocks.
Tips⚠️: Here you will find that the LN layer of GPT2 is placed before the attention and mlp, which is different from the original text in the above image where the LN layer and residual connection (residual first, then normalization) are connected.
Karpathy's explanation is: the original model first connects the residual and then performs LN normalization, indicating that the connected residual is also normalized, which is not ideal. A pure residual is better because during backpropagation, when the gradient flows back, the addition evenly distributes its gradient to its two branches, meaning the gradient flows directly to the input through the residual path, which is preferable from an optimization perspective.To be honest, I didn't understand his explanation, so I searched online for related content and found that GPT2's approach is called pre-norm, while the approach in Attention is All You Need is called post-norm.
Su's explanation of the differences between these two is very insightful. The residual connection is . If the variance of is and the variance of is , then the variance after the residual connection is , meaning the residual amplifies the variance. We need to find a way to reduce this variance. A naive method is to add normalization, i.e., . However, while this stabilizes the forward propagation variance, it severely weakens the identity branch of the residual, thus losing the advantage of the residual being "easy to train." Typically, it requires warmup and a sufficiently small learning rate to converge. The transformer structure itself has two characteristics: sensitivity to hyperparameters during the warm-up phase and slow convergence during the optimization process. (The author is also unsure why this is the case), which means that under post-norm conditions, it becomes even harder to converge, and the training cost will also increase to some extent.
Now, let’s explain how the identity branch of the residual is weakened (which is what Karpathy refers to as a clean residual). Suppose initially and both have a variance of 1, then has a variance of 2, while the normalization operation is responsible for reducing the variance back to 1. This indicates that in the initial stage, Post Norm is equivalent to
Recursively,
The original meaning of the residual is to create a "green channel" for the previous input layers, allowing gradients to be transmitted more directly. However, in Post Norm, this "green channel" is severely weakened, with weights decreasing the closer it gets to the front, making it so that after multiple residual connections, the earlier residuals cannot perceive the gradient changes at the end, rendering the residual "nominal," thus making it harder to train.For the paper, see "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE."
The corrected Pre-Norm takes the form of
Expanding the iteration:
Each residual channel is equally weighted, making the effect of the residual more pronounced than in Post Norm, thus it is easier to optimize. Of course, this means that the final variance will be large, so before the prediction layer, also needs to add a normalization, which is exactly what ln_f
does.
Karpathy mentions that Attention is where tokens communicate; it is a pooling function, a weighted sum function, and a reduce operation
. MLP occurs on each individual token, with no information collected or exchanged between tokens; it is a map operation
.
MLP#
class mlp(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, config.n_embd*4)
self.Gelu = nn.GELU(approximate='tanh')
self.c_proj = nn.Linear(config.n_embd*4, config.n_embd)
def forward(self, x):
x = self.c_fc(x)
x = self.Gelu(x)
x = self.c_proj(x)
return x
A very simple MLP linear mapping from [n_embd, 4 * n_embd]
to [4 * n_embd, n_embd]
, with a non-linear layer GELU activation function in between. The function graph of GELU is very similar to that of Relu, but it is differentiable at the tail, which solves the problem of Relu having a derivative of 0 when x is less than 0, and this smoothness produces better results.
Karpathy discusses why the tanh approximation is used, mentioning that this is a historical issue; using the exact GELU in TensorFlow was particularly slow, so a function using tanh to approximate GELU was developed.
Attention#
class CasualSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, config.n_embd*3)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_embd = config.n_embd
self.n_head = config.n_head
# The role of this will be discussed later in the weight module
self.c_proj.NANOGPT_SCALE_INIT = 1
self.register_buffer("bias", torch.tril((torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)))
def forward(self, x):
B, T, C = x.size()
qkv = self.c_attn(x)
# Obtain qkv
q, k, v = qkv.split(self.n_embd, dim=2)
# query, key, value are all split into [B, n_head, T, n_embd//n_head]
query = q.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
key = k.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
value = v.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
# QK^T/d
att = query @ key.transpose(-1, -2) * (1.0/math.sqrt(key.size(-1)))
mask_att = att.masked_fill(self.bias[:,:,:T,:T]==0, float('-inf'))
wei = F.softmax(mask_att, dim=-1)
out = wei @ value
out = out.transpose(1,2).contiguous().view(B, T, C)
out = self.c_proj(out)
return out
self.c_attn
: the combination of , transforming the input into the inputs .
self.c_proj
: a linear layer after computing .
self.n_embd
: the feature vector space for each token.
self.n_head
: the number of heads in the multi-head attention mechanism.
self.bias
: here, the bias means a mask, specifically an upper triangular matrix, preventing earlier tokens from learning from later tokens. The specific principle is as follows: for the input :
-inf
will turn into a value close to 0 during the subsequent softmax
process, thus having no effect on classification.
contiguous()
: transpose does not change the physical order, only the formal order, and using this function can correct the physical order. For example, for the array , transpose(1,2)
results in , but the physical storage of these two arrays is , so an error will occur when performing a view operation on the transposed array.
Download from Hugging Face#
# Inside class GPT
@classmethod
def from_pretrained(cls, model_type):
"""Loads pretrained GPT-2 model weights from huggingface"""
# Four types of models
assert model_type in {'gpt2','gpt2-medium','gpt2-large','gpt2-xl'}
# Print which model you are loading
print("Loading weights from pretrained gpt:%s"%model_type)
# Each GPT corresponds to different hyperparameters
config_args ={
'gpt2' : dict(n_layer=12,n_head=12,n_embd=768), # 124M params
'gpt2-medium' : dict(n_layer=24,n_head=16,n_embd=1024), #350M params
'gpt2-large' : dict(n_layer=36,n_head=20,n_embd=1280), #774M params
'gpt2-xl' : dict(n_layer=48,n_head=25,n_embd=1600), #1558M params
}[model_type]
# Vocabulary size is always 50527
config_args['vocab_size'] = 50257
# The size of a single block is always 1024
config_args['block_size'] = 1024
# Import hyperparameters into the model
config = GPTConfig(**config_args)
model = GPT(config)
# sd is the model parameter dictionary
sd = model.state_dict()
sd_keys = sd.keys()
sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask
# Download weights from HF, sd_hf is the downloaded model parameter dictionary
model_hf = GPT2LMHeadModel.from_pretrained(model_type, cache_dir="/home/shong_Tan/project/gpt_2/model_weight", local_files_only=True)
sd_hf = model_hf.state_dict()
sd_keys_hf = sd_hf.keys()
# Discard the mask bias in HF weights
sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
# Discard bias in HF weights
sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
# Ensure that sd and hf_sd have the same number of parameter names
assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
# Ensure that sd and hf_sd have the same transformer block weight names
for k in sd_keys_hf:
if any(k.endswith(w) for w in transposed):
assert sd_hf[k].shape[::-1] == sd[k].shape
with torch.no_grad():
sd[k].copy_(sd_hf[k].t())
else:
assert sd_hf[k].shape == sd[k].shape, f"mismatched keys: {sd_hf[k].shape} != {sd[k].shape}"
with torch.no_grad():
sd[k].copy_(sd_hf[k])
return model
Just read the code comments.
Tips⚠️: The shapes of lm_head.weight
and transformer.wte.weight
downloaded from HF are the same, both are . One is the input embedding, and the other is the output logits; these two should be consistent, reflecting that when a token is embedded into a feature vector, after interaction, when outputting, it is still this feature vector, which can be converted back to the original token. Meanwhile, , which can save a lot of GPU memory.
Forward#
# Inside class GPT
def forward(self, idx, target):
# Entering with dimensions [batch, token length]
B, T = idx.size()
# Token length cannot exceed context
assert T <= self.config.block_size, f"Exceeds input context length limit {T-self.config.block_size} tokens"
# pos [0,1,2,..,T-1], and remember to place it on the device
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
# Position embedding
pos = self.transformer.wpe(pos) #(T, n_embd)
# Token embedding
tok = self.transformer.wte(idx) #(B, T, n_embd)
# Adding values along the (T, n_embd) dimension
x = tok + pos
# Passing through transformer blocks
for block in self.transformer.h:
x = block(x)
# Final layer normalization
x = self.transformer.ln_f(x)
# Linear layer output
logits = self.lm_head(x) #(B, T, vocab_size)
loss = None
# If there is a target, i.e., a label, training is performed to calculate the loss function; otherwise, inference can be performed
if target is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
return logits, loss
# A small test
num_return_sequences = 5
max_length = 30
model = GPT.from_pretrained('gpt2')
# eval will disable dropout during evaluation, and will behave differently for batchnorm, as well as freeze parameters
model.eval()
# Move the model to GPU
model.to('cuda')
# The following is tokenization; just use OpenAI's tiktoken library. If you want to know the principle, it is recommended to check the blogger's blog at the beginning of the article.
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I'm a language model, ")
tokens = torch.tensor(tokens, dtype=torch.long) # [8, ]
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) # [5,8]
x = tokens.to('cuda')
while x.size(1) < max_length:
with torch.no_grad():
# Input the model to get results
logits, loss = model(x) # x: [B, T] logits:[B,T,C]
# Extract the prediction of the last token
logits = logits[:, -1, :] # [B, 1, C]
# Take softmax on the last dimension C
probs = F.softmax(logits, dim=-1) # [B, 1, C]
# Select the top k largest probabilities and corresponding indices from the top k probabilities
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
# Randomly select a probability from the top k
ix = torch.multinomial(topk_probs, 1)
# Find the index corresponding to the selected probability
xcol = torch.gather(topk_indices, -1, ix)
# Concatenate the obtained output token to x as input [B, T+1]
x = torch.cat((x, xcol), dim=1)
Tokenization form: transforming "Hello, I'm a language model, " into [15496, 11, 314, 1101, 257, 3303, 2746, 11, 220].
You can try it yourself at the following website:
https://tiktokenizer.vercel.app/
Initialization#
Dataset#
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
# This is for the M series of Apple's chips
elif hasattr(torch.backends, "eps") and torch.backends.mps.is_available():
device = 'mps'
print("Using device: ", device)
import tiktoken
enc = tiktoken.get_encoding('gpt2')
with open('input.txt', 'r') as f:
data = f.read()
text = data[:1000]
tokens = enc.encode(text)
B, T = 4, 32
buf = torch.tensor(tokens[:B*T+1])
buf.to(device)
# Essentially predicting n+1 words from the first n words
x = buf[:-1].view(B, T)
y = buf[1:].view(B,T)
model.GPT(GPTConfig())
model.to(device)
logits, loss = model(x)
print(loss.item())
Here, loss
is approximately 11, because .
Training a single batch code:
# Using the AdamW optimizer; understand the difference between Adam and SGD yourself
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
# Clear the optimizer's gradients
optimizer.zero_grad()
# Get logits and loss
logits, loss = model(x, y)
# Backpropagate to compute gradients
loss.backward()
# Update the original parameters using the gradients
optimizer.step()
The Adam optimizer can converge faster than SGD.
Dataloader function:
class DataLoaderLite():
def __init__(self, B, T):
self.B = B
self.T = T
# Read the entire input.txt
with open('input.txt', 'r') as f:
data = f.read()
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode(data)
self.tokens = torch.tensor(tokens, dtype=torch.long)
print(f"load {len(self.tokens)} tokens")
print(f"1 epoch = {len(self.tokens)//(B*T)} batched")
# Define the current position in the batch
self.current_position = 0
def next_batch(self):
B, T = self.B, self.T
buf = self.tokens[self.current_position: self.current_position+B*T+1]
x = buf[:-1].view(B, T)
y = buf[1:].view(B, T)
# Each batch has B*T pairs of elements
self.current_position += B*T
# If the batch runs out of tokens, return to tokens[0]
if self.current_position+B*T+1 > len(self.tokens):
self.current_position = 0
return x, y
Corrected training code:
train_loader = DataLoaderLite(4, 32 )
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
optimizer.zero_grad()
x, y = train_loader.next_batch()
x, y = x.to(device), y.to(device)
logits, loss = model(x, y)
loss.backward()
optimizer.step()
print(f"step: {i}, loss: {loss.item()}")
Weights#
def _init_weights(self, module):
if isinstance(module, nn.Linear):
std = 0.02
if hasattr(module, ' '):
std = std * (2*self.config.n_layer**-0.5)
torch.nn.init.normal_(module.weight, mean=0, std=std)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0, std=0.02)
std = std * (2*self.config.n_layer)**-0.5
: The variance here considers the contribution of the residual flow. Each residual connection indicates that the input has had an equal contribution once, requiring a factor to handle it, . This controls the excessive variance caused by the residual connection in Pre-Norm. The factor of 2 is because both Attention and MLP use a residual once in each layer.
std
: The value of std is also based on reasoning. According to the documentation in GPT2, it is best to be around .