Output


A step by step explanation of what happens inside the model: 1) The user inputs some text, which is converted to lowercase (the model was trained on lowercase text only, hence the output) 2) All text is tokenized, or turned into numbers (i.e: "who is caesar?" becomes [14, 623, 1237, 124]). This model uses a vocabulary size of 3000, which means it only knows 3000 different combinations of text bytes. The ones it doesn't know get classified as "unknown" tokens. 3) Each token gets a learned embedding value. The embedding is how the model knows what a token "means", so synonyms like "fight" and "battle" would have similar embedding values. The embedding is a vector of 512 values. 4) Then each token receives a positional embedding, which is some 512-dimension vector based on its position in the sequence. A more advanced model would use relative position embeddings rather than learned ones. Side note: why do we need to tell the model each token's position? You need to understand the attention mechanism to understand this, but basically the model doesn't know the ordering of the tokens, and you need to tell it. Read "Attention is All You Need" for more info on this. 5) Then the embedded representations of all the input tokens go through 6 layers of transformers. Each transformer layer is a multi-head attention block followed by some linear layers that expand then collapse the dimensionality (which stays constant between layers). 6) After passing through all the transformers, the transformed input goes to a final linear layer that predicts the next token. This is simplified, and ignores all the activation functions, layer norms, and skip connections, but this covers the important aspects. RomeGPT is very basic and small (22.3M params), compared to Claude or ChatGPT, whos parameter counts are not public but likely number in the trillions. This isn't fact, but you probably need >1B parameters to make a decent chatbot. Training one of these from scratch on consumer GPUs (which I used for this model, and they took over an hour just to train this version of the model!) isn't the most practical. But, I wanted to try it, and therefore RomeGPT was made.