X are vectors. Act is the activation function. Subscript is layer.
Watch this first: Neural networks
X, K are matrices. Convolutions from a same kernel are summed. Then, either max pooling or mean pooling. The pooled matrices become the input of the next convolution layer. Each convolution layer has a different set of kernels. The image only shows two kernels, but usually there can be around 16 or 32 different kernels per layer. Subscript is layer, superscript is member within layer.
Watch these first: Convolution, Code implementation
Perform convolution multiple times (see Convolution Layer). Then flatten the matrices into vectors. Then pass through a regular feedforward network.
Watch this first (go to the appropriate section of the playlist): Neural networks
Tokenizer is described in a later section. For now, see it as a blackbox that splits text into individual tokens. Together, the tokens form a "vocabulary". The number of tokens k is the vocabulary size. The number of tokens that can be processed by the LLM at any single instant is called the context size C. Each known token has a corresponding vector, which are found in the embedding matrix WE. The array of tokens of length C is converted into an array of corresponding vectors of length C, which are stacked vertically to produce the embedding. The latter is summed with positional encoding, divided by a scaling factor, undergo dropout, to finally produce the input matrix X1, which is fed to the first transformer.
An attention head consists of a triple Wq (query matrix), Wk (key matrix), Wv (value matrix). The input matrix is fed into the attention head, and the latter produces a resulting matrix which is also called an attention head. This matrix has the same number of rows as the input matrix, but its number of columns is smaller. This is because we will repeat this operation on the same input matrix with several different attention heads (thus different triples Wq, Wk, Wv), and we concatenate/stack the attention heads horizontally.
As you realize, the concatenated attention heads may yield a matrix which has a different number of columns than the original input matrix (the rows are still the same). So the output matrix (Wo) has the exact dimensions such that the resulting product Y has the same size as X.
After passing X through the attention layer, each row of the resulting Y goes through a different FFN. The activation function is usually RELU. Single hidden layer, which is around twice the size of input. Output must have same size as input. The outputs are concatenated/stacked vertically, and are added to the original X, to produce the input matrix for the next transformer. In an LLM, there are many transformers one after the other.
More clarifications here: GPT breakdown
Original paper: Original paper
The result of the last transformer is fed into the unembedding layer. We take the last vector and apply the unembedding matrix Wu onto it. We divide by the temperature (higher temperature creates more randomness in token choice). Softmax squeezes the result into a prob dist, then we proceed to select a random token based on the prob dist.
See BPE.
Beta 1, beta 2, and epsilon are constants which are set by you (standard values are 0.9, 0.99, and 10^-8 respectively). Learning rate "a" can be set by yourself, and usually you set it to decrease as training goes on (unlike the previous three constants). "g" is the calculated gradient from the current round of backpropagation. "d" is exactly how much your parameters will change (final parameter = paramater + d) during this round of backpropagation.
To visualize Adam optimizer, just imagine a ball rolling around terrain (the cost function). It has a momentum, so you need a lot of incentive for the ball to change direction.
Beta 1 tells you how much you want to keep the original momentum Mprev. The rest, (1 - beta 1), is the new momentum you want to apply on the ball that is rolling, which is simply your gradient. Likewise, beta 2 does the same, but for V.
Notice how V is kinda squared (i.e. g hadamard g, the circle with the dot is the hadamard product). Think of V as acceleration. The Adam optimizer tries to prevent the ball from accelerating, wanting a rather constant walk around the terrain.
Now, at the start, momentums are near zero, since (1 - beta 1) makes g's influence very small. So, to remedy the sluggish start, we divide by (1 - beta 1 ^ t). Notice the exponent t, which is the current number of rounds of backpropagation. Since beta1 < 1, as training goes on, this term vanishes, leaving division by 1. It is then safe to remove M hat, and simply work with M directly. The same goes for V and V hat.
Now, let's look at d. We could have simply written d = -a * Mhat, where a is the current learning rate. (Notice that Mhat is related to the gradient, so you negate it). The reason why we divide by Vhat is because we want to prevent acceleration. It is square rooted in order to be on the same magnitude as M. Finally, epsilon is a small term to prevent division by zero and explosion when Vhat is zero or very small.
As you see, you need to decrease the learning rate "a" while training, since d will always be kind of big.
Layer normalization is applied to each row vector (each word/token) singly. X is the vector. u is the average of the vector's components. o^2 is the variance. Epsilon is to prevent division by zero. Gamma and beta are learnable parameters.
Description: We center the vector back around zero. We then "normalize" its components by dividing by the variance (large variance means that the components are too spread out). After that, it's basically a line equation y = mx + b, where m is gamma. Gamma can be thought as the length of the vector, and beta can be thought as the bias.
Typically, we do layer normalization after each transformer before feeding it into the next transformer.