LLM Evolution History - Rebuilding GPT2 - (1)

This article is based on Andrej Karpathy's 4-hour reproduction of GPT-2. After watching it, I found it to be an excellent video; it serves as the concluding chapter in the evolution of LLMs. This article provides a textual supplement based on that. For earlier content, please refer to the homepage at https://blog.nagi.fun, where the blogger has written very comprehensively.

This series is planned to be divided into three parts: the main implementation, accelerated implementation, and distributed training.

Implementing GPT-2 nn.Module#

Config Configuration#

@dataclass
class GPTConfig():
    block_size: int=1024     # Sequence length limit (context window length)
    vocab_size: int=50527    # Vocabulary size
    n_layer: int=12          # Number of Transformer layers
    n_head: int=12           # Number of attention heads
    n_embd: int=768          # Embedding dimension (vector length for each token)

@dataclass decorator defines a configuration class named GPTConfig

(If you don't understand decorators, you can look it up on CSDN or Zhihu)

Why use dataclass:

• Regular classes require manually writing the __init__ method, but with the decorator, it's very simple.
• Supports explicit declaration, and you can directly print(GPTConfig(n_head=16)) to print the parameters.

BackBone#

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            # word token embedding
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            # word position embedding
            wpe = nn.Embedding(config.block_size, config.n_embd),
            # main block
            h = nn.ModuleList([Block(config) for _ in range(config.n_layers)]),
            # word token embedding,
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.attn = CasualSelfAttention(config)
        self.mlp = mlp(config)
    
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

transformer

self.transformer: the core of the transformer architecture.

nn.ModuleDict: a dictionary within nn.Module, nn.ModuleDict(ln_f = nn.LayerNorm(config.n_embd),) can be understood as {ln_f: nn.LayerNorm(config.n_embd)}.

wte: the linear layer from word to embedding [length of vocabulary, embedding dimension], transforming words into feature vectors.

wpe: the linear layer from word position to embedding [sequence length, embedding dimension], transforming positional information into feature vectors.

h: the encoder core of the transformer, where each Block consists of an attention and an mlp.

ln_f: LayerNorm, normalizing the large variance obtained after Pre-Norm, further explanation follows.

lm_head: the final output layer, converting the feature vector of the word into a specific word.

Block: the transformer consists of multiple identical Blocks.

Tips⚠️: Here you will find that the LN layer of GPT2 is placed before the attention and mlp, which is different from the original text in the above image where the LN layer and residual connection (residual first, then normalization) are connected.

Karpathy's explanation is: the original model first connects the residual and then performs LN normalization, indicating that the connected residual is also normalized, which is not ideal. A pure residual is better because during backpropagation, when the gradient flows back, the addition evenly distributes its gradient to its two branches, meaning the gradient flows directly to the input through the residual path, which is preferable from an optimization perspective.

To be honest, I didn't understand his explanation, so I searched online for related content and found that GPT2's approach is called pre-norm, while the approach in Attention is All You Need is called post-norm.

pre-norm

Su's explanation of the differences between these two is very insightful. The residual connection is $x+F(x)$ . If the variance of $x$ is $\sigma^2_1$ and the variance of $F(x)$ is $\sigma^2_2$ , then the variance after the residual connection is $\sigma^2_1+\sigma^2_2$ , meaning the residual amplifies the variance. We need to find a way to reduce this variance. A naive method is to add normalization, i.e., $x_{t+1}=Norm(x_t+F(x))$ . However, while this stabilizes the forward propagation variance, it severely weakens the identity branch of the residual, thus losing the advantage of the residual being "easy to train." Typically, it requires warmup and a sufficiently small learning rate to converge. The transformer structure itself has two characteristics: sensitivity to hyperparameters during the warm-up phase and slow convergence during the optimization process. (The author is also unsure why this is the case), which means that under post-norm conditions, it becomes even harder to converge, and the training cost will also increase to some extent.

Now, let’s explain how the identity branch of the residual is weakened (which is what Karpathy refers to as a clean residual). Suppose initially $x$ and $F(x)$ both have a variance of 1, then $x+F(x)$ has a variance of 2, while the normalization operation is responsible for reducing the variance back to 1. This indicates that in the initial stage, Post Norm is equivalent to

x_{t+1}=\frac{x_t+F(x_t)}{\sqrt{2}}

Recursively,

x_l=\frac{x_{l-1}}{\sqrt{2}}+\frac{F_{l-1}(x_{l-1})}{\sqrt{2}}=\frac{x_{l-2}}{{2}}+\frac{F_{l-2}(x_{l-2})}{{2}}+\frac{F_{l-1}(x_{l-1})}{\sqrt{2}}

x_l=\frac{x_{0}}{{2^{l/2}}}+\frac{F_{0}(x_{0})}{{2^{l/2}}}+\frac{F_{1}(x_{1})}{{2}^{(l-1)/2}}+\frac{F_{2}(x_{2})}{{2}^{(l-2)/2}}+...+\frac{F_{l-1}(x_{l-1})}{{2}^{1/2}}

The original meaning of the residual is to create a "green channel" for the previous input layers, allowing gradients to be transmitted more directly. However, in Post Norm, this "green channel" is severely weakened, with weights decreasing the closer it gets to the front, making it so that after multiple residual connections, the earlier residuals cannot perceive the gradient changes at the end, rendering the residual "nominal," thus making it harder to train.

For the paper, see "ON LAYER NORMALIZATION IN THE TRANSFORMER ARCHITECTURE."

The corrected Pre-Norm takes the form of

x_{t+1}=x_t+F_t(Norm(x_t))

Expanding the iteration:

x_{t+1}=x_t+F_t(Norm(x_t)) = x_{t-1}+F_{t-1}(Norm(x_{t-1}))+F_t(Norm(x_t))

x_{t}=x_{0}+F_{0}(Norm(x_{0}))+F_1(Norm(x_1))+...+F_{l-1}(Norm(x_{l-1}))

Each residual channel is equally weighted, making the effect of the residual more pronounced than in Post Norm, thus it is easier to optimize. Of course, this means that the final $x_l$ variance will be large, so before the prediction layer, $x_l$ also needs to add a normalization, which is exactly what ln_f does.

Karpathy mentions that Attention is where tokens communicate; it is a pooling function, a weighted sum function, and a reduce operation. MLP occurs on each individual token, with no information collected or exchanged between tokens; it is a map operation.

MLP#

class mlp(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_embd*4)
        self.Gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(config.n_embd*4, config.n_embd)
    
    def forward(self, x):
        x = self.c_fc(x)
        x = self.Gelu(x)
        x = self.c_proj(x)
        return x

A very simple MLP linear mapping from [n_embd, 4 * n_embd] to [4 * n_embd, n_embd], with a non-linear layer GELU activation function in between. The function graph of GELU is very similar to that of Relu, but it is differentiable at the tail, which solves the problem of Relu having a derivative of 0 when x is less than 0, and this smoothness produces better results.

Karpathy discusses why the tanh approximation is used, mentioning that this is a historical issue; using the exact GELU in TensorFlow was particularly slow, so a function using tanh to approximate GELU was developed.

GELU

Attention#

class CasualSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, config.n_embd*3)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        # The role of this will be discussed later in the weight module
        self.c_proj.NANOGPT_SCALE_INIT = 1 
        self.register_buffer("bias", torch.tril((torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)))
    
    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        # Obtain qkv
        q, k, v = qkv.split(self.n_embd, dim=2)
        # query, key, value are all split into [B, n_head, T, n_embd//n_head]
        query = q.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        key = k.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        value = v.view(B, T, self.n_head, C//self.n_head).transpose(1, 2)
        # QK^T/d
        att = query @ key.transpose(-1, -2) * (1.0/math.sqrt(key.size(-1)))
        mask_att = att.masked_fill(self.bias[:,:,:T,:T]==0, float('-inf'))
        wei = F.softmax(mask_att, dim=-1)
        out = wei @ value
        out = out.transpose(1,2).contiguous().view(B, T, C)
        out = self.c_proj(out)
        return out

self.c_attn: the combination of $W_q,W_k,W_v$ , transforming the input $x$ into the inputs $Q,K,V$ .

self.c_proj: a linear layer after computing $\frac{QK^T}{\sqrt{d_k}}V$ .

self.n_embd: the feature vector space for each token.

self.n_head: the number of heads in the multi-head attention mechanism.

self.bias: here, the bias means a mask, specifically an upper triangular matrix, preventing earlier tokens from learning from later tokens. The specific principle is as follows: for the input $x$ :

x=\begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix},bias=\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{bmatrix} ,mask_{att}=\begin{bmatrix} x_{11} & -inf & -inf \\ x_{21} & x_{22} & -inf \\ x_{31} & x_{32} & x_{33} \end{bmatrix}

-inf will turn into a value close to 0 during the subsequent softmax process, thus having no effect on classification.

contiguous(): transpose does not change the physical order, only the formal order, and using this function can correct the physical order. For example, for the array $[[[1,2][7,8]],[[3,4][5,6]]]\space \space \space \space \space shape=[1,2,2]$ , transpose(1,2) results in $[[[1,2][2,3]],[[3,5][4,6]]]\space \space \space \space \space shape=[1,2,2]$ , but the physical storage of these two arrays is $[1,2,7,8,3,4,5,6]$ , so an error will occur when performing a view operation on the transposed array.

Download from Hugging Face#

    # Inside class GPT
    @classmethod
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        # Four types of models
        assert model_type in {'gpt2','gpt2-medium','gpt2-large','gpt2-xl'}
        # Print which model you are loading
        print("Loading weights from pretrained gpt:%s"%model_type)
        # Each GPT corresponds to different hyperparameters
        config_args ={
        'gpt2' : dict(n_layer=12,n_head=12,n_embd=768), # 124M params
        'gpt2-medium' : dict(n_layer=24,n_head=16,n_embd=1024), #350M params
        'gpt2-large' : dict(n_layer=36,n_head=20,n_embd=1280), #774M params
        'gpt2-xl' : dict(n_layer=48,n_head=25,n_embd=1600), #1558M params
        }[model_type]
        # Vocabulary size is always 50527
        config_args['vocab_size'] = 50257 
        # The size of a single block is always 1024
        config_args['block_size'] = 1024
        # Import hyperparameters into the model
        config = GPTConfig(**config_args)
        model = GPT(config)
        # sd is the model parameter dictionary
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask

        # Download weights from HF, sd_hf is the downloaded model parameter dictionary
        model_hf = GPT2LMHeadModel.from_pretrained(model_type, cache_dir="/home/shong_Tan/project/gpt_2/model_weight", local_files_only=True)
        sd_hf = model_hf.state_dict()

        sd_keys_hf = sd_hf.keys()
        # Discard the mask bias in HF weights
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
        # Discard bias in HF weights
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # Ensure that sd and hf_sd have the same number of parameter names
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        # Ensure that sd and hf_sd have the same transformer block weight names
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                assert sd_hf[k].shape == sd[k].shape, f"mismatched keys: {sd_hf[k].shape} != {sd[k].shape}"
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])
        return model

Just read the code comments.

Tips⚠️: The shapes of lm_head.weight and transformer.wte.weight downloaded from HF are the same, both are $[50527, 768]$ . One is the input embedding, and the other is the output logits; these two should be consistent, reflecting that when a token is embedded into a feature vector, after interaction, when outputting, it is still this feature vector, which can be converted back to the original token. Meanwhile, $50527*768 \approx 40M$ , which can save a lot of GPU memory.

Forward#

    # Inside class GPT
    def forward(self, idx, target):
        # Entering with dimensions [batch, token length]
        B, T = idx.size()
        # Token length cannot exceed context
        assert T <= self.config.block_size, f"Exceeds input context length limit {T-self.config.block_size} tokens"
        # pos [0,1,2,..,T-1], and remember to place it on the device
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        # Position embedding
        pos = self.transformer.wpe(pos) #(T, n_embd)
        # Token embedding
        tok = self.transformer.wte(idx) #(B, T, n_embd)
        # Adding values along the (T, n_embd) dimension
        x = tok + pos
        # Passing through transformer blocks
        for block in self.transformer.h:
            x = block(x)
        # Final layer normalization
        x = self.transformer.ln_f(x)
        # Linear layer output
        logits = self.lm_head(x) #(B, T, vocab_size)
        loss = None
        # If there is a target, i.e., a label, training is performed to calculate the loss function; otherwise, inference can be performed
        if target is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
        return logits, loss

# A small test
num_return_sequences = 5
max_length = 30
model = GPT.from_pretrained('gpt2')
# eval will disable dropout during evaluation, and will behave differently for batchnorm, as well as freeze parameters
model.eval()
# Move the model to GPU
model.to('cuda')

# The following is tokenization; just use OpenAI's tiktoken library. If you want to know the principle, it is recommended to check the blogger's blog at the beginning of the article.
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I'm a language model, ")
tokens = torch.tensor(tokens, dtype=torch.long) # [8, ]
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1) # [5,8]
x = tokens.to('cuda')

while x.size(1) < max_length:
    with torch.no_grad():
        # Input the model to get results
        logits, loss  = model(x) # x: [B, T]    logits:[B,T,C]
        # Extract the prediction of the last token
        logits = logits[:, -1, :] # [B, 1, C]
        # Take softmax on the last dimension C
        probs = F.softmax(logits, dim=-1) # [B, 1, C]
        # Select the top k largest probabilities and corresponding indices from the top k probabilities
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) 
        # Randomly select a probability from the top k
        ix = torch.multinomial(topk_probs, 1)
        # Find the index corresponding to the selected probability
        xcol = torch.gather(topk_indices, -1, ix)
        # Concatenate the obtained output token to x as input [B, T+1]
        x = torch.cat((x, xcol), dim=1)

Tokenization form: transforming "Hello, I'm a language model, " into [15496, 11, 314, 1101, 257, 3303, 2746, 11, 220].

You can try it yourself at the following website:

https://tiktokenizer.vercel.app/

Initialization#

Dataset#

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
# This is for the M series of Apple's chips
elif hasattr(torch.backends, "eps") and torch.backends.mps.is_available():
    device = 'mps'
print("Using device: ", device)
import tiktoken
enc = tiktoken.get_encoding('gpt2')
with open('input.txt', 'r') as f:
    data = f.read()
text = data[:1000]
tokens = enc.encode(text)
B, T = 4, 32
buf = torch.tensor(tokens[:B*T+1])
buf.to(device)
# Essentially predicting n+1 words from the first n words
x = buf[:-1].view(B, T)
y = buf[1:].view(B,T)

model.GPT(GPTConfig())
model.to(device)
logits, loss = model(x)
print(loss.item())

Here, loss is approximately 11, because $-log(\frac{1}{50257})\approx 11$ .

Training a single batch code:

# Using the AdamW optimizer; understand the difference between Adam and SGD yourself
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
    # Clear the optimizer's gradients
    optimizer.zero_grad()
    # Get logits and loss
    logits, loss = model(x, y)
    # Backpropagate to compute gradients
    loss.backward()
    # Update the original parameters using the gradients
    optimizer.step()

The Adam optimizer can converge faster than SGD.

Dataloader function:

class DataLoaderLite():

    def __init__(self, B, T):
        self.B = B
        self.T = T
        # Read the entire input.txt
        with open('input.txt', 'r') as f:
            data = f.read()
        enc = tiktoken.get_encoding('gpt2')
        tokens = enc.encode(data)
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        print(f"load {len(self.tokens)} tokens")
        print(f"1 epoch = {len(self.tokens)//(B*T)} batched")
        # Define the current position in the batch
        self.current_position = 0

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_position: self.current_position+B*T+1]
        x = buf[:-1].view(B, T)
        y = buf[1:].view(B, T)
        # Each batch has B*T pairs of elements
        self.current_position += B*T
        # If the batch runs out of tokens, return to tokens[0]
        if self.current_position+B*T+1 > len(self.tokens):
            self.current_position = 0
        return x, y

Corrected training code:

train_loader = DataLoaderLite(4, 32 )
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
    optimizer.zero_grad()
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)
    logits, loss = model(x, y)
    loss.backward()
    optimizer.step()
    print(f"step: {i}, loss: {loss.item()}")

Weights#

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, ' '):
                std = std * (2*self.config.n_layer**-0.5)
            torch.nn.init.normal_(module.weight, mean=0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)

        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0, std=0.02)

std = std * (2*self.config.n_layer)**-0.5: The variance here considers the contribution of the residual flow. Each residual connection indicates that the input has had an equal contribution once, requiring a factor to handle it, $\frac{1}{\sqrt{2*n_{layer}}}$ . This controls the excessive variance caused by the residual connection in Pre-Norm. The factor of 2 is because both Attention and MLP use a residual once in each layer.

std: The value of std is also based on reasoning. According to the documentation in GPT2, it is best to be around $\frac{1}{\sqrt{n_{embd}}}$ .